Joe Peterson
2008-Feb-09 16:52 UTC
[zfs-discuss] Checksum error in single device ZFS pool & mysterious resilvering after reboot
I have recently been experimenting with ZFS (under FreeBSD 7.0), and I am excited about the possibilities. I have seen, however, two puzzling issues, and I am hoping someone here might have some insight or info. Note that the original zpool that I am discussing is gone now - I had to recreate a new one to make space for a larger root partition, etc... Note that I had one ZFS pool with only a single device/partition (i.e. no mirroring or raidz). Anyway, the two issues are: 1) During a routine test scrub, a checksum error was found. There were also some other errors reported around this time, but not fatal (note that I am investigating potential FreeBSD DMA messages in my log - could be related; the hardware seems to check out fine, BTW). This checksum error was persistent due to the fact that there was no mirror. One file was affected, and upon further investigation, a 64K block (1/2 of a ZFS block?) appeared to be data from some other file. 2) After my first reboot from the above situation, I found that zpool status reported that "resilvering completed...". Also, all error counts were zeroed (not unusual), but the pool now reported that no known errors existed. Rescrubbing revealed the checksum error again, but this time no subsequent reboots ever caused resilvering or hiding of the file error again. The puzzling thing is why would ZFS resilver anything, given that there was only one device in the pool? Plus, the fact that the pool now thought it was "OK" was disturbing. I am, of course, unsure if 64K of the original file was corrupted at the time of the original file wrote, some time after that (well, at least it appeared to be text data that would have been either from another file, specifically a Mozilla mork DB file, or perhaps on its way to another file and mistakenly written into this file (although no other files had checksum issues)), or if the strange "resilver" caused the block to become this way. The history of that file is that I had copied my entire home dir to the ZFS pool with no errors reported, and I had not touched that file before the scrub found the problem. I did some forensics later by temporarily disabling the checksum error in the ZFS code in order to read the whole file (this is how I revealed the contents of the bad block). The bad 64K was only a small part of the file, starting at byte 655360 and going for exactly 64K. The rest of the file, up to 3MB or so, was fine. Anyway, does any of the above ring any bells? Thanks, Joe