After replacing a bad disk and waiting for the resilver to complete, I started a scrub of the pool. Currently, I have the pool mounted readonly, yet almost a quarter of the I/O is writes to the new disk. In fact, it looks like there are so many checksum errors, that zpool doesn''t even list them properly: pool: p state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub in progress, 18.71% done, 2h17m to go config: NAME STATE READ WRITE CKSUM p ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2d0 ONLINE 0 0 0 c3d0 ONLINE 0 0 0 c5d0 ONLINE 0 0 0 c4d0 ONLINE 0 0 231.5 errors: No known data errors I assume that that should be followed by a K. Is my brand new replacement disk really returning gigabyte after gigabyte of silently corrupted data? I find that quite amazing, and I thought that I would inquire here. This is on snv_60. Chris
I have some further data now, and I don''t think that it is a hardware problem. Half way through the scrub, I rebooted and exchanged the controller and cable used with the "bad" disk. After restarting the scrub, it proceeded error free until about the point where it left off, and then it resumed the exact same behavior. Basically, almost exactly one fourth of the amount of data that is read from the resilvered disk is written to the same disk. This was constant throughout the scrub. Meanwhile, fmd writes ereport.fs.zfs.io events to errlog, until the disk is full. At this point, it seems as if the resilvering code in snv_60 is broken, and one fourth of the data was not reconstructed properly. I have an iosnoop trace of the disk in question, if anyone is interested. I will try to make some sense of it, but that probably won''t happen today. Chris