If ZFS experiences a checksum error on a block that''s part of a file being read by a user process, and the block is in an unreplicated pool, it logs the error and signals an I/O error rather than return data to the process. Is there a way for a process to request the corrupted block as-is? The block might still be of some value, especially since the corruption is likely due to just a single flipped bit, and for some data (e.g. audio and video data) a slightly corrupted block is much better than a missing or zeroed block. Also for textual data, especially text intended only for human consumption, a mostly-correct block with a single flipped bit (and thus normally a single corrupted character) is much better than losing many characters. At the very least, there needs to be an administrative tool to update the block checksums for a file to match the blocks as they actually exist on disk, so that user processes can access the corrupted file. This message posted from opensolaris.org
On Tue, Dec 13, 2005 at 11:02:10AM -0800, Andrew wrote:> If ZFS experiences a checksum error on a block that''s part of a file > being read by a user process, and the block is in an unreplicated > pool, it logs the error and signals an I/O error rather than return > data to the process. Is there a way for a process to request the > corrupted block as-is?Not currently. See: 6186106 ZFS needs a mechanism to read blocks with bad checksums If you have any suggestions about what such a mechanism would look like, please let us know.> The block might still be of some value, > especially since the corruption is likely due to just a single flipped > bit.This is not necessarily true. See this thread: http://www.opensolaris.org/jive/thread.jspa?threadID=3705> and for some data (e.g. audio and video data) a slightly > corrupted block is much better than a missing or zeroed block. Also > for textual data, especially text intended only for human consumption, > a mostly-correct block with a single flipped bit (and thus normally a > single corrupted character) is much better than losing many > characters. At the very least, there needs to be an administrative > tool to update the block checksums for a file to match the blocks as > they actually exist on disk, so that user processes can access the > corrupted file.This is a pretty dangerous thing to do. First of all, it would be impossible for metadata - simply ''fixing'' the checksum would most likely result in panics and arbitrary corruption. It would be possible to do this for data blocks, but as soon as you do, you lose any way to identify these blocks after the fact. I think we need to understand some real-world cases of corruption as well as specific consumers that would benefit, before we decide that this is a required feature. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
> > Is there a way for a process > to request the > > corrupted block as-is? > > Not currently. See: > > 6186106 ZFS needs a mechanism to read blocks with > bad checksumsGoogle search for "ZFS needs a mechanism to read blocks with bad checksums" turns up nothing. Google search for 6186106 ZFS turns up nothing.> > The block might still be of some value, > > especially since the corruption is likely due to > just a single flipped > > bit. > > This is not necessarily true. See this thread: > > http://www.opensolaris.org/jive/thread.jspa?threadID=3705I stand corrected. However, for a single-block failure, ZFS is still guaranteed to excise the entire glob of corrupted data and almost certainly some good data along with it, and my comment still applies that for some applications (notably audio and video) it''s better to get a block of partially-good data with some random garbage mixed in than to get an entirely unreadable block.> This is a pretty dangerous thing to do. First of > all, it would be > impossible for metadata - simply ''fixing'' the > checksum would most likely > result in panics and arbitrary corruption.True. But even in this case, at least allowing the data to be manually read (though not interpreted by the filesystem) could be helpful, because it could allow manual partial recovery of directory contents.> It would > be possible to do > this for data blocks, but as soon as you do, you > lose any way to > identify these blocks after the fact.Well, ZFS would have already informed the administrator about those blocks; at that point, if he tells ZFS to rechecksum them, it''s his responsibility to identify them after the fact. The most practical course of action would probably be to request a full-pool search through all datasets to find all files and volumes which reference the block, flag them in some way as being corrupted and note the offending offset range, and then tell ZFS to rechecksum the block. This message posted from opensolaris.org
Eric Schrock
2005-Dec-14 01:01 UTC
[zfs-discuss] Re: Requesting a corrupted data block as-is
On Tue, Dec 13, 2005 at 04:35:19PM -0800, Andrew wrote:> > Google search for "ZFS needs a mechanism to read blocks with bad > checksums" turns up nothing. Google search for 6186106 ZFS turns up > nothing.Sorry, this is a reference to a Solaris bug. See the OpenSolaris bug database for more info. We understand your concerns, but some of the problems (such as metadata reconstruction) are next to impossible to accomplish in any programmatic way. We can see the benefit of providing a means to access corrupted blocks, hence the RFE. But in the face of compression and arbitrarily corrupted metadata, it''s simply not very high on our list. As always, feel free to play around and prove us wrong. libzpool/zdb provides a safe way to read (but not write) pool data in userland so you can experiment without panicking your system ;-) - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Richard Elling
2005-Dec-14 01:19 UTC
[zfs-discuss] Re: Re: Requesting a corrupted data block as-is
> We understand your concerns, but some of the problems (such as metadata > reconstruction) are next to impossible to accomplish in any programmatic > way. We can see the benefit of providing a means to access corrupted > blocks, hence the RFE. But in the face of compression and arbitrarily > corrupted metadata, it''s simply not very high on our list.Don''t underestimate this barrier. For compressed or (future?) encrypted data, even a single bit flip has consequences far beyond the block in question. If you allow this to be automatically propagated, then life gets real bad, real quick for most folks. It is better to flag the problem and let a human deal with it. -- richard This message posted from opensolaris.org
Casper.Dik at Sun.COM
2005-Dec-14 09:15 UTC
[zfs-discuss] Re: Requesting a corrupted data block as-is
>We understand your concerns, but some of the problems (such as metadata >reconstruction) are next to impossible to accomplish in any programmatic >way. We can see the benefit of providing a means to access corrupted >blocks, hence the RFE. But in the face of compression and arbitrarily >corrupted metadata, it''s simply not very high on our list. > >As always, feel free to play around and prove us wrong. libzpool/zdb >provides a safe way to read (but not write) pool data in userland so you >can experiment without panicking your system ;-)If such corruption happens and you fix it (any which way); would it be possible for ZFS to have leaked some blocks which are now unreachable? Casper
Eric Schrock
2005-Dec-14 15:29 UTC
[zfs-discuss] Re: Requesting a corrupted data block as-is
On Wed, Dec 14, 2005 at 10:15:09AM +0100, Casper.Dik at sun.com wrote:> > If such corruption happens and you fix it (any which way); would it be > possible for ZFS to have leaked some blocks which are now unreachable? >Yes. We have talked about various ways to handle this, but we haven''t had any wonderful ideas yet. Right now, we could just leak the blocks forever, or do offline garbage collection of blocks (which is slow and not very ZFS-like because it needs to be offline). - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Casper.Dik at Sun.COM
2005-Dec-14 16:22 UTC
[zfs-discuss] Re: Requesting a corrupted data block as-is
>On Wed, Dec 14, 2005 at 10:15:09AM +0100, Casper.Dik at sun.com wrote: >> >> If such corruption happens and you fix it (any which way); would it be >> possible for ZFS to have leaked some blocks which are now unreachable? >> > >Yes. We have talked about various ways to handle this, but we haven''t >had any wonderful ideas yet. Right now, we could just leak the blocks >forever, or do offline garbage collection of blocks (which is slow and >not very ZFS-like because it needs to be offline).Considering that this only happens on meta data corruption issues, this should be a rar occurence anyway. Casper
George Paplas
2005-Dec-19 01:08 UTC
[zfs-discuss] Re: Requesting a corrupted data block as-is
--- Eric Schrock <eric.schrock at sun.com> wrote:> On Tue, Dec 13, 2005 at 04:35:19PM -0800, Andrew wrote: > > > > Google search for "ZFS needs a mechanism to read blocks with bad > > checksums" turns up nothing. Google search for 6186106 ZFS turns > up > > nothing. > > Sorry, this is a reference to a Solaris bug. See the OpenSolaris bug > database for more info. > > We understand your concerns, but some of the problems (such as > metadata > reconstruction) are next to impossible to accomplish in any > programmatic > way. We can see the benefit of providing a means to access corrupted > blocks, hence the RFE. But in the face of compression and > arbitrarily > corrupted metadata, it''s simply not very high on our list.I guess that if the checksum of a block fails and there is no way to heal it, then you could try doing a brute-force bit flipping till the checksum matches. Probably you will need to limit the total number of bit-flips in order to be reasonably fast and safe of checksum collisions.> > As always, feel free to play around and prove us wrong. libzpool/zdb > provides a safe way to read (but not write) pool data in userland so > you > can experiment without panicking your system ;-) > > - Eric > > -- > Eric Schrock, Solaris Kernel Development > http://blogs.sun.com/eschrock > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >__________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Andrew
2005-Dec-19 16:20 UTC
[zfs-discuss] Re: Re: Requesting a corrupted data block as-is
> I guess that if the checksum of a block fails and there is no way to > heal it, > then you could try doing a brute-force bit flipping till the checksum > matches.In other words, invert the hash using the corrupted block as a hint.> Probably you will need to limit the total number of bit-flips in order > to be > reasonably fast and safe of checksum collisions.Speed, rather than collisions, will be the limiting factor for the forseeable future. Brute-forcing for a 128kB block (the maximum ZFS size) would require a million full-block hashes to test one flipped bit, and a trillion for two; the latter would take months on a contemporary system. Even for a 512-byte block (the minimum ZFS size), testing four flipped bits would require hundreds of trillions of full-block hashes, which would take months. This message posted from opensolaris.org