Richard Elling
2009-Sep-28 16:58 UTC
[zfs-discuss] Quickest way to find files with cksum errors without doing scrub
On Sep 28, 2009, at 2:41 PM, Albert Chin wrote:> Without doing a zpool scrub, what''s the quickest way to find files > in a > filesystem with cksum errors? Iterating over all files with "find" > takes > quite a bit of time. Maybe there''s some zdb fu that will perform the > check for me?Scrub could be faster, but you can try tar cf - . > /dev/null If you think about it, validating checksums requires reading the data. So you simply need to read the data. -- richard
Bob Friesenhahn
2009-Sep-28 17:09 UTC
[zfs-discuss] Quickest way to find files with cksum errors without doing scrub
On Mon, 28 Sep 2009, Richard Elling wrote:> > Scrub could be faster, but you can try > tar cf - . > /dev/null > > If you think about it, validating checksums requires reading the data. > So you simply need to read the data.This should work but it does not verify the redundant metadata. For example, the duplicate metadata copy might be corrupt but the problem is not detected since it did not happen to be used. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Elling
2009-Sep-28 17:16 UTC
[zfs-discuss] Quickest way to find files with cksum errors without doing scrub
On Sep 28, 2009, at 3:42 PM, Albert Chin wrote:> On Mon, Sep 28, 2009 at 12:09:03PM -0500, Bob Friesenhahn wrote: >> On Mon, 28 Sep 2009, Richard Elling wrote: >>> >>> Scrub could be faster, but you can try >>> tar cf - . > /dev/null >>> >>> If you think about it, validating checksums requires reading the >>> data. >>> So you simply need to read the data. >> >> This should work but it does not verify the redundant metadata. For >> example, the duplicate metadata copy might be corrupt but the problem >> is not detected since it did not happen to be used. > > Too bad we cannot scrub a dataset/object.Can you provide a use case? I don''t see why scrub couldn''t start and stop at specific txgs for instance. That won''t necessarily get you to a specific file, though. -- richard
Tim Cook
2009-Sep-28 17:22 UTC
[zfs-discuss] Quickest way to find files with cksum errors without doing scrub
On Mon, Sep 28, 2009 at 12:16 PM, Richard Elling <richard.elling at gmail.com>wrote:> On Sep 28, 2009, at 3:42 PM, Albert Chin wrote: > > On Mon, Sep 28, 2009 at 12:09:03PM -0500, Bob Friesenhahn wrote: >> >>> On Mon, 28 Sep 2009, Richard Elling wrote: >>> >>>> >>>> Scrub could be faster, but you can try >>>> tar cf - . > /dev/null >>>> >>>> If you think about it, validating checksums requires reading the data. >>>> So you simply need to read the data. >>>> >>> >>> This should work but it does not verify the redundant metadata. For >>> example, the duplicate metadata copy might be corrupt but the problem >>> is not detected since it did not happen to be used. >>> >> >> Too bad we cannot scrub a dataset/object. >> > > Can you provide a use case? I don''t see why scrub couldn''t start and > stop at specific txgs for instance. That won''t necessarily get you to a > specific file, though. > -- richard > > > On Mon, Sep 28, 2009 at 12:16 PM, Richard Elling <richard.elling at gmail.com > wrote:> On Sep 28, 2009, at 3:42 PM, Albert Chin wrote: > > On Mon, Sep 28, 2009 at 12:09:03PM -0500, Bob Friesenhahn wrote: >> >>> On Mon, 28 Sep 2009, Richard Elling wrote: >>> >>>> >>>> Scrub could be faster, but you can try >>>> tar cf - . > /dev/null >>>> >>>> If you think about it, validating checksums requires reading the data. >>>> So you simply need to read the data. >>>> >>> >>> This should work but it does not verify the redundant metadata. For >>> example, the duplicate metadata copy might be corrupt but the problem >>> is not detected since it did not happen to be used. >>> >> >> Too bad we cannot scrub a dataset/object. >> > > Can you provide a use case? I don''t see why scrub couldn''t start and > stop at specific txgs for instance. That won''t necessarily get you to a > specific file, though. > -- richard >I get the impression he just wants to check a single file in a pool without waiting for it to check the entire pool. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090928/1c973171/attachment.html>
Bob Friesenhahn
2009-Sep-28 17:25 UTC
[zfs-discuss] Quickest way to find files with cksum errors without doing scrub
On Mon, 28 Sep 2009, Bob Friesenhahn wrote:> > This should work but it does not verify the redundant metadata. For example, > the duplicate metadata copy might be corrupt but the problem is not detected > since it did not happen to be used.I am finding that your tar incantation is reading hardly any data from disk when testing my home directory and the ''tar'' happens to be GNU tar: # time tar cf - . > /dev/null tar cf - . > /dev/null 2.72s user 12.43s system 96% cpu 15.721 total # du -sh . 82G Looks like the GNU folks slipped in a small performance "enhancement" if the output is to /dev/null. Make sure to use /bin/tar, which seems to actually read the data. When actually reading the data via tar, read performance is very poor. Hopefully I will have a ZFS IDR to test with in the next few days which fixes the prefetch bug. Zpool scrub reads the data at 360MB/second but this tar method is only reading at an average of 6MB/second to 42MB/second (according to zpool iostat). Wups, I just saw a one-minute average of 105MB and then 131MB. Quite variable. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Victor Latushkin
2009-Sep-28 17:31 UTC
[zfs-discuss] Quickest way to find files with cksum errors without doing scrub
Richard Elling wrote:> On Sep 28, 2009, at 3:42 PM, Albert Chin wrote: > >> On Mon, Sep 28, 2009 at 12:09:03PM -0500, Bob Friesenhahn wrote: >>> On Mon, 28 Sep 2009, Richard Elling wrote: >>>> >>>> Scrub could be faster, but you can try >>>> tar cf - . > /dev/null >>>> >>>> If you think about it, validating checksums requires reading the data. >>>> So you simply need to read the data. >>> >>> This should work but it does not verify the redundant metadata. For >>> example, the duplicate metadata copy might be corrupt but the problem >>> is not detected since it did not happen to be used. >> >> Too bad we cannot scrub a dataset/object. > > Can you provide a use case? I don''t see why scrub couldn''t start and > stop at specific txgs for instance. That won''t necessarily get you to a > specific file, though.With ever increasing disk and pool sizes it takes more and more time for scrub to complete its job. Let''s imagine that you have 100TB pool with 90TB of data in it, and there''s dataset with 10TB that is critical and another dataset with 80TB that is not that critical and you can afford loosing some blocks/files there. So being able to scrub individual dataset would help to run scrubs of critical data more frequently and faster and schedule scrubs for less frequently used and/or less important data to happen much less frequently. It may be useful to have a way to tell ZFS to scrub pool-wide metadata only (space maps etc), so that you can build your own schedule of scrubs. Another interesting idea is to be able to scrub only blocks modified since last snapshot. victor
Richard Elling
2009-Sep-28 18:01 UTC
[zfs-discuss] Quickest way to find files with cksum errors without doing scrub
On Sep 28, 2009, at 10:31 AM, Victor Latushkin wrote:> Richard Elling wrote: >> On Sep 28, 2009, at 3:42 PM, Albert Chin wrote: >>> On Mon, Sep 28, 2009 at 12:09:03PM -0500, Bob Friesenhahn wrote: >>>> On Mon, 28 Sep 2009, Richard Elling wrote: >>>>> >>>>> Scrub could be faster, but you can try >>>>> tar cf - . > /dev/null >>>>> >>>>> If you think about it, validating checksums requires reading the >>>>> data. >>>>> So you simply need to read the data. >>>> >>>> This should work but it does not verify the redundant metadata. >>>> For >>>> example, the duplicate metadata copy might be corrupt but the >>>> problem >>>> is not detected since it did not happen to be used. >>> >>> Too bad we cannot scrub a dataset/object. >> Can you provide a use case? I don''t see why scrub couldn''t start and >> stop at specific txgs for instance. That won''t necessarily get you >> to a >> specific file, though. > > With ever increasing disk and pool sizes it takes more and more time > for scrub to complete its job. Let''s imagine that you have 100TB > pool with 90TB of data in it, and there''s dataset with 10TB that is > critical and another dataset with 80TB that is not that critical and > you can afford loosing some blocks/files there.Personally, I have three concerns here. 1. Gratuitous complexity, especially inside a pool -- aka creeping featurism 2. Wouldn''t a better practice be to use two pools with different protection policies? The only protection policy differences inside a pool are copies. In other words, I am concerned that people replace good data protection practices with scrubs and expecting scrub to deliver better data protection (it won''t). 3. Since the pool contains the set of blocks, shared by datasets, it is not clear to me that scrubbing a dataset will detect all of the data corruption failures which can affect the dataset. I''m thinking along the lines of phantom writes, for example. 4. the time it takes to scrub lots of stuff ...there are four concerns... :-) For magnetic media, a yearly scrub interval should suffice for most folks. I know some folks who scrub monthly. More frequent scrubs won''t buy much. Scrubs are also useful for detecting broken hardware. However, normal activity will also detect broken hardware, so it is better to think of scrubs as finding degradation of old data rather than being a hardware checking service.> So being able to scrub individual dataset would help to run scrubs > of critical data more frequently and faster and schedule scrubs for > less frequently used and/or less important data to happen much less > frequently. > > It may be useful to have a way to tell ZFS to scrub pool-wide > metadata only (space maps etc), so that you can build your own > schedule of scrubs. > > Another interesting idea is to be able to scrub only blocks modified > since last snapshot.This can be relatively easy to implement. But remember that scrubs are most useful for finding data which has degraded from the media. In other words, old data. New data is not likely to have degraded yet, and since ZFS is COW, all of the new data is, well, new. This is why having the ability to bound the start and end of a scrub by txg can be easy and perhaps useful. -- richard
Victor Latushkin
2009-Sep-28 18:28 UTC
[zfs-discuss] Quickest way to find files with cksum errors without doing scrub
On 28.09.09 22:01, Richard Elling wrote:> On Sep 28, 2009, at 10:31 AM, Victor Latushkin wrote: > >> Richard Elling wrote: >>> On Sep 28, 2009, at 3:42 PM, Albert Chin wrote: >>>> On Mon, Sep 28, 2009 at 12:09:03PM -0500, Bob Friesenhahn wrote: >>>>> On Mon, 28 Sep 2009, Richard Elling wrote: >>>>>> >>>>>> Scrub could be faster, but you can try >>>>>> tar cf - . > /dev/null >>>>>> >>>>>> If you think about it, validating checksums requires reading the >>>>>> data. >>>>>> So you simply need to read the data. >>>>> >>>>> This should work but it does not verify the redundant metadata. For >>>>> example, the duplicate metadata copy might be corrupt but the problem >>>>> is not detected since it did not happen to be used. >>>> >>>> Too bad we cannot scrub a dataset/object. >>> Can you provide a use case? I don''t see why scrub couldn''t start and >>> stop at specific txgs for instance. That won''t necessarily get you to a >>> specific file, though. >> >> With ever increasing disk and pool sizes it takes more and more time >> for scrub to complete its job. Let''s imagine that you have 100TB pool >> with 90TB of data in it, and there''s dataset with 10TB that is >> critical and another dataset with 80TB that is not that critical and >> you can afford loosing some blocks/files there. > > Personally, I have three concerns here. > 1. Gratuitous complexity, especially inside a pool -- aka creeping > featurismThere''s the idea of priority-based resilvering (though not implemented yet, see http://blogs.sun.com/bonwick/en_US/entry/smokin_mirrors) that can be simply extended to scrubs as well.> 2. Wouldn''t a better practice be to use two pools with different > protection > policies? The only protection policy differences inside a pool > are copies. > In other words, I am concerned that people replace good data > protection > practices with scrubs and expecting scrub to deliver better data > protection > (it won''t).It may be better, it may be not... With two pools you split you bandwidth and IOPS and space and have more entities to care about...> 3. Since the pool contains the set of blocks, shared by datasets, it > is not clear > to me that scrubbing a dataset will detect all of the data > corruption failures > which can affect the dataset. I''m thinking along the lines of > phantom writes, > for example.That is why it may be useful to always scrub pool-wide metadata or have a way to specifically request it.> 4. the time it takes to scrub lots of stuff > ...there are four concerns... :-) > > For magnetic media, a yearly scrub interval should suffice for most > folks. I know > some folks who scrub monthly. More frequent scrubs won''t buy much.It won''t buy you much in term of magnetic media decay discovery. Unfortunately, there other sources of corruption as well (including phantom writes you are thinking about), and being able to discover corruption and recover it as quickly as possible from the backup it a good thing.> Scrubs are also useful for detecting broken hardware. However, normal > activity will also detect broken hardware, so it is better to think of > scrubs as > finding degradation of old data rather than being a hardware checking > service. > > >> So being able to scrub individual dataset would help to run scrubs of >> critical data more frequently and faster and schedule scrubs for less >> frequently used and/or less important data to happen much less >> frequently. >> >> It may be useful to have a way to tell ZFS to scrub pool-wide metadata >> only (space maps etc), so that you can build your own schedule of scrubs. >> >> Another interesting idea is to be able to scrub only blocks modified >> since last snapshot. > > This can be relatively easy to implement. But remember that scrubs are most > useful for finding data which has degraded from the media. In other > words, old > data. New data is not likely to have degraded yet, and since ZFS is COW, > all of > the new data is, well, new.> This is why having the ability to bound the start and end of a scrub by txg > can be easy and perhaps useful.This requires exporting concept of the transaction group numbers to the user and i do not see how it is less complex from the user interface perspective than being able to request scrub of individual dataset, pool-wide metadata or newly-written data. regards, victor
Bob Friesenhahn
2009-Sep-28 18:41 UTC
[zfs-discuss] Quickest way to find files with cksum errors without doing scrub
On Mon, 28 Sep 2009, Richard Elling wrote:> In other words, I am concerned that people replace good > data protection > practices with scrubs and expecting scrub to deliver better data > protection > (it won''t).Many people here would profoundly disagree with the above. There is no substitute for good backups, but a periodic scrub helps validate that a later resilver would succeed. A perioic scrub also helps find system problems early when they are less likely to crater your business. It is much better to find an issue during a scrub rather than during resilver of a mirror or raidz.> Scrubs are also useful for detecting broken hardware. However, > normal activity will also detect broken hardware, so it is better to > think of scrubs as finding degradation of old data rather than being > a hardware checking service.Do you have a scientific reference for this notion that "old data" is more likely to be corrupt than "new data" or is it just a gut-feeling? This hypothesis does not sound very supportable to me. Magnetic hysteresis lasts quite a lot longer than the recommended service life for a hard drive. Studio audio tapes from the ''60s are still being used to produce modern "remasters" of old audio recordings which sound better than they ever did before (other than the master tape). Some forms of magnetic hysteresis are known to last millions of years. Media failure is more often than not mechanical or chemical and not related to loss of magnetic hysteresis. Head failures may be construed to be media failures. See http://en.wikipedia.org/wiki/Ferromagnetic for information on ferromagnetic materials. It would be most useful if zfs incorporated a slow-scan scrub which validates data at a low rate of speed which does not hinder active I/O. Of course this is not a "green" energy efficient solution. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Albert Chin
2009-Sep-28 21:41 UTC
[zfs-discuss] Quickest way to find files with cksum errors without doing scrub
Without doing a zpool scrub, what''s the quickest way to find files in a filesystem with cksum errors? Iterating over all files with "find" takes quite a bit of time. Maybe there''s some zdb fu that will perform the check for me? -- albert chin (china at thewrittenword.com)
Albert Chin
2009-Sep-28 22:42 UTC
[zfs-discuss] Quickest way to find files with cksum errors without doing scrub
On Mon, Sep 28, 2009 at 12:09:03PM -0500, Bob Friesenhahn wrote:> On Mon, 28 Sep 2009, Richard Elling wrote: >> >> Scrub could be faster, but you can try >> tar cf - . > /dev/null >> >> If you think about it, validating checksums requires reading the data. >> So you simply need to read the data. > > This should work but it does not verify the redundant metadata. For > example, the duplicate metadata copy might be corrupt but the problem > is not detected since it did not happen to be used.Too bad we cannot scrub a dataset/object. -- albert chin (china at thewrittenword.com)
Albert Chin
2009-Sep-28 22:58 UTC
[zfs-discuss] Quickest way to find files with cksum errors without doing scrub
On Mon, Sep 28, 2009 at 10:16:20AM -0700, Richard Elling wrote:> On Sep 28, 2009, at 3:42 PM, Albert Chin wrote: > >> On Mon, Sep 28, 2009 at 12:09:03PM -0500, Bob Friesenhahn wrote: >>> On Mon, 28 Sep 2009, Richard Elling wrote: >>>> >>>> Scrub could be faster, but you can try >>>> tar cf - . > /dev/null >>>> >>>> If you think about it, validating checksums requires reading the >>>> data. >>>> So you simply need to read the data. >>> >>> This should work but it does not verify the redundant metadata. For >>> example, the duplicate metadata copy might be corrupt but the problem >>> is not detected since it did not happen to be used. >> >> Too bad we cannot scrub a dataset/object. > > Can you provide a use case? I don''t see why scrub couldn''t start and > stop at specific txgs for instance. That won''t necessarily get you to a > specific file, though.If your pool is borked but mostly readable, yet some file systems have cksum errors, you cannot "zfs send" that file system (err, snapshot of filesystem). So, you need to manually fix the file system by traversing it to read all files to determine which must be fixed. Once this is done, you can snapshot and "zfs send". If you have many file systems, this is time consuming. Of course, you could just rsync and be happy with what you were able to recover, but if you have clones branched from the same parent, which a few differences inbetween shapshots, having to rsync *everything* rather than just the differences is painful. Hence the reason to try to get "zfs send" to work. But, this is an extreme example and I doubt pools are often in this state so the engineering time isn''t worth it. In such cases though, a "zfs scrub" would be useful. -- albert chin (china at thewrittenword.com)
Richard Elling
2009-Sep-28 23:39 UTC
[zfs-discuss] Quickest way to find files with cksum errors without doing scrub
On Sep 28, 2009, at 11:41 AM, Bob Friesenhahn wrote:> On Mon, 28 Sep 2009, Richard Elling wrote: > >> In other words, I am concerned that people replace good data >> protection >> practices with scrubs and expecting scrub to deliver better >> data protection >> (it won''t). > > Many people here would profoundly disagree with the above. There is > no substitute for good backups, but a periodic scrub helps validate > that a later resilver would succeed. A perioic scrub also helps > find system problems early when they are less likely to crater your > business. It is much better to find an issue during a scrub rather > than during resilver of a mirror or raidz.As I said, I am concerned that people would mistakenly expect that scrubbing offers data protection. It doesn''t. I think you proved my point? ;-)>> Scrubs are also useful for detecting broken hardware. However, >> normal activity will also detect broken hardware, so it is better >> to think of scrubs as finding degradation of old data rather than >> being a hardware checking service. > > Do you have a scientific reference for this notion that "old data" > is more likely to be corrupt than "new data" or is it just a gut- > feeling? This hypothesis does not sound very supportable to me. > Magnetic hysteresis lasts quite a lot longer than the recommended > service life for a hard drive. Studio audio tapes from the ''60s are > still being used to produce modern "remasters" of old audio > recordings which sound better than they ever did before (other than > the master tape).Those are analog tapes... they just fade away... For data, it depends on the ECC methods, quality of the media, environment, etc. You will find considerable attention spent on verification of data on tapes in archiving products. In the tape world, there are slightly different conditions than the magnetic disk world, but I can''t think of a single study which shows that magnetic disks get more reliable over time, while there are dozens which show that they get less reliable and that latent sector errors dominate, as much as 5x, over full disk failures. My studies of Sun disk failure rates have shown similar results.> Some forms of magnetic hysteresis are known to last millions of > years. Media failure is more often than not mechanical or chemical > and not related to loss of magnetic hysteresis. Head failures may > be construed to be media failures.Here is a good study from the University of Wisconsin-Madison which clearly shows the relationship between disk age and latent sector errors. It also shows how the increase in aerial density also increases the latent sector error (LSE) rate. Additionally, this gets back to the ECC method, which we observe to be different on consumer-grade and enterprise-class disks. The study shows a clear win for enterprise-class drives wrt latent errors. The paper suggests a 2- week scrub cycle and recognizes that many RAID arrays have such policies. There are indeed many studies which show latent sector errors are a bigger problem as the disk ages. An Analysis of Latent Sector Errors in Disk Drives www.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.ps> > See http://en.wikipedia.org/wiki/Ferromagnetic for information on > ferromagnetic materials.For disks we worry about the superparamagnetic effect. http://en.wikipedia.org/wiki/Superparamagnetism Quoting US Patent 6987630, ... the superparamagnetic effect is a thermal relaxation of information stored on the disk surface. Because the superparamagnetic effect may occur at room temperature, over time, information stored on the disk surface will begin to decay. Once the stored information decays beyond a threshold level, it will be unable to be properly read by the read head and the information will be lost. The superparamagnetic effect manifests itself by a loss in amplitude in the readback signal over time or an increase in the mean square error (MSE) of the read back signal over time. In other words, the readback signal quality metrics are means square error and amplitude as measured by the read channel integrated circuit. Decreases in the quality of the readback signal cause bit error rate (BER) increases. As is well known, the BER is the ultimate measure of drive performance in a disk drive. This effect is based on the time since written. Hence, older data can have higher MSE and subsequent BER leading to a UER. To be fair, newer disk technology is constantly improving. But what is consistent with the physics is that increase in bit densities leads to more space and rebalancing the BER. IMHO, this is why we see densities increase, but UER does not increase (hint: marketing always wins these sorts of battles). FWIW, flash memories are not affected by superparamagnetic decay.> It would be most useful if zfs incorporated a slow-scan scrub which > validates data at a low rate of speed which does not hinder active I/ > O. Of course this is not a "green" energy efficient solution.Oprea and Juels write, "Our key insight is that more aggressive scrubbing does not always increase disk reliability, as previously believed." They show how read-induced LSEs would tend to encourage you to scrub less frequently. They also discuss the advantage of random versus sequential scrubbing. I would classify zfs scrubs as more random than sequential, for most workloads. Their model is even more sophisticated and considers scrubbing policy based on the age of the disk and how many errors have been previously detected. A Clean-Slate Look at Disk Scrubbing http://www.rsa.com/rsalabs/staff/bios/aoprea/publications/scrubbing.pdf Finally, there are two basic types of scrubs: read-only and rewrite. ZFS does read-only. Other scrubbers can do rewrite. There is evidence that rewrites are better for attacking superparamagnetic decay issues. So it is still not clear what the best scrubbing model or interval should be for the general case. I suggest scrubbing periodically, but not continuously :-) Currently, scrub has the lowest priority for the vdev_queue. But I think the vdev_queue could use more research. -- richard
David Magda
2009-Sep-29 01:46 UTC
[zfs-discuss] Quickest way to find files with cksum errors without doing scrub
On Sep 28, 2009, at 19:39, Richard Elling wrote:> Finally, there are two basic types of scrubs: read-only and > rewrite. ZFS does > read-only. Other scrubbers can do rewrite. There is evidence that > rewrites > are better for attacking superparamagnetic decay issues.Something that may be possible when *bp rewrite is eventually committed. Educating post. Thanks.
Robert Milkowski
2009-Sep-29 02:08 UTC
[zfs-discuss] Quickest way to find files with cksum errors without doing scrub
Bob Friesenhahn wrote:> On Mon, 28 Sep 2009, Richard Elling wrote: >> >> Scrub could be faster, but you can try >> tar cf - . > /dev/null >> >> If you think about it, validating checksums requires reading the data. >> So you simply need to read the data. > > This should work but it does not verify the redundant metadata. For > example, the duplicate metadata copy might be corrupt but the problem > is not detected since it did not happen to be used. >Not only that - it won''t also read all the copies of data if zfs has redundancy configured at a pool level. Scrubbing the pool will. And that''s the main reason behind the scrub - to be able to detect and repair checksum errors (if any) while a redundant copy is still fine. -- Robert Milkowski http://milek.blogspot.com
Robert Milkowski
2009-Sep-29 02:12 UTC
[zfs-discuss] Quickest way to find files with cksum errors without doing scrub
Robert Milkowski wrote:> Bob Friesenhahn wrote: >> On Mon, 28 Sep 2009, Richard Elling wrote: >>> >>> Scrub could be faster, but you can try >>> tar cf - . > /dev/null >>> >>> If you think about it, validating checksums requires reading the data. >>> So you simply need to read the data. >> >> This should work but it does not verify the redundant metadata. For >> example, the duplicate metadata copy might be corrupt but the problem >> is not detected since it did not happen to be used. >> > > Not only that - it won''t also read all the copies of data if zfs has > redundancy configured at a pool level. Scrubbing the pool will. And > that''s the main reason behind the scrub - to be able to detect and > repair checksum errors (if any) while a redundant copy is still fine. >Also doing tar means reading from ARC and/or L2ARC if data is cached which won''t verify if data is actually fine on a disk. Scrub won''t use a cache and will always go to physical disks. -- Robert Milkowski http://milek.blogspot.com
Bob Friesenhahn
2009-Sep-29 03:43 UTC
[zfs-discuss] Quickest way to find files with cksum errors without doing scrub
On Mon, 28 Sep 2009, Richard Elling wrote:>> >> Many people here would profoundly disagree with the above. There is no >> substitute for good backups, but a periodic scrub helps validate that a >> later resilver would succeed. A perioic scrub also helps find system >> problems early when they are less likely to crater your business. It is >> much better to find an issue during a scrub rather than during resilver of >> a mirror or raidz. > > As I said, I am concerned that people would mistakenly expect that scrubbing > offers data protection. It doesn''t. I think you proved my point? ;-)It does not specifically offer data "protection" but if you have only duplex redundancy, it substantially helps find and correct a failure which would have caused data loss during a resilver. The value substantially diminishes if you have triple redundancy. I hope it does not offend that I scrub my mirrored pools once a week. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/