Gary Mills
2009-Jan-14 22:39 UTC
[zfs-discuss] What are the usual suspects in data errors?
I realize that any error can occur in a storage subsystem, but most of these have an extremely low probability. I''m interested in this discussion in only those that do occur occasionally, and that are not catastrophic. Consider the common configuration of two SCSI disks connected to the same HBA that are configured as a mirror in some manner. In this case, the data path in general consists of: o The application o The filesystem o The drivers o The HBA o The SCSI bus o The controllers o The heads and patters Many of those components have their own error checking. Some have error correction. For example, parity checking is done on a SCSI bus, unless it''s specifically disabled. Do SATA and PATA connections also do error checking? Disk sector I/O uses CRC error checking and correction. Memory buffers would often be protected by parity memory. Is there any more that I''ve missed? Now, let''s consider common errors. To me, the most frequent would be a bit error on a disk sector. In this case, the controller would report a CRC error and would not return bad data. The filesystem would obtain the data from its redundant copy. I assume that ZFS would also rewrite the bad sector to correct it. The application would not see an error. Similar events would happen for a parity error on the SCSI bus. What can go wrong with the disk controller? A simple seek to the wrong track is not a problem because the track number is encoded on the platter. The controller will simply recalibrate the mechanism and retry the seek. If it computes the wrong sector, that would be a problem. Does this happen with any frequency? In this case, ZFS would detect a checksum error and obtain the data from its redundant copy. A logic error in ZFS might result in incorrect metadata being written with valid checksum. In this case, ZFS might panic on import or might correct the error. How is this sort of error prevented? If the application wrote bad data to the filesystem, none of the error checking in lower layers would detect it. This would be strictly an error in the application. Some errors might result from a loss of power if some ZFS data was written to a disk cache but never was written to the disk platter. Again, ZFS might panic on import or might correct the error. How is this sort of error prevented? After all of this discussion, what other errors can ZFS checksums reasonably detect? Certainly if some of the other error checking failed to detect an error, ZFS would still detect one. How likely are these other error checks to fail? Is there anything else I''ve missed in this analysis? -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
A Darren Dunham
2009-Jan-14 23:15 UTC
[zfs-discuss] What are the usual suspects in data errors?
On Wed, Jan 14, 2009 at 04:39:03PM -0600, Gary Mills wrote:> I realize that any error can occur in a storage subsystem, but most > of these have an extremely low probability. I''m interested in this > discussion in only those that do occur occasionally, and that are > not catastrophic.What level is "extremely low" here?> Many of those components have their own error checking. Some have > error correction. For example, parity checking is done on a SCSI bus, > unless it''s specifically disabled. Do SATA and PATA connections also > do error checking? Disk sector I/O uses CRC error checking and > correction. Memory buffers would often be protected by parity memory. > Is there any more that I''ve missed?Reports suggest that bugs in drive firmware could account for errors at a level that is not insignificant.> What can go wrong with the disk controller? A simple seek to the > wrong track is not a problem because the track number is encoded on > the platter. The controller will simply recalibrate the mechanism and > retry the seek. If it computes the wrong sector, that would be a > problem. Does this happen with any frequency?Netapp documents certain rewrite bugs that they''ve specifically seen. I would imagine they have good data on the frequency that they see it in the field.> In this case, ZFS > would detect a checksum error and obtain the data from its redundant > copy.Correct.> A logic error in ZFS might result in incorrect metadata being written > with valid checksum. In this case, ZFS might panic on import or might > correct the error. How is this sort of error prevented?It''s very difficult to protect yourself from software bugs with the same piece of software. You can create assertions that are hopefully simpler and less prone to errors, but they will not catch all bugs.> Some errors might result from a loss of power if some ZFS data was > written to a disk cache but never was written to the disk platter. > Again, ZFS might panic on import or might correct the error. How is > this sort of error prevented?ZFS uses a multi-stage commit. It relies on the "disk" responding to a request to flush caches to the disk. If that assumption is correct, then there is no problem in general with power issues. The disk is consistent both before and after the cache is flushed. -- Darren
darn, Darren, learning fast! best, z ----- Original Message ----- From: "A Darren Dunham" <ddunham at taos.com> To: <zfs-discuss at opensolaris.org> Sent: Wednesday, January 14, 2009 6:15 PM Subject: Re: [zfs-discuss] What are the usual suspects in data errors?> On Wed, Jan 14, 2009 at 04:39:03PM -0600, Gary Mills wrote: >> I realize that any error can occur in a storage subsystem, but most >> of these have an extremely low probability. I''m interested in this >> discussion in only those that do occur occasionally, and that are >> not catastrophic. > > What level is "extremely low" here? > >> Many of those components have their own error checking. Some have >> error correction. For example, parity checking is done on a SCSI bus, >> unless it''s specifically disabled. Do SATA and PATA connections also >> do error checking? Disk sector I/O uses CRC error checking and >> correction. Memory buffers would often be protected by parity memory. >> Is there any more that I''ve missed? > > Reports suggest that bugs in drive firmware could account for errors at > a level that is not insignificant. > >> What can go wrong with the disk controller? A simple seek to the >> wrong track is not a problem because the track number is encoded on >> the platter. The controller will simply recalibrate the mechanism and >> retry the seek. If it computes the wrong sector, that would be a >> problem. Does this happen with any frequency? > > Netapp documents certain rewrite bugs that they''ve specifically seen. I > would imagine they have good data on the frequency that they see it in > the field. > >> In this case, ZFS >> would detect a checksum error and obtain the data from its redundant >> copy. > > Correct. > >> A logic error in ZFS might result in incorrect metadata being written >> with valid checksum. In this case, ZFS might panic on import or might >> correct the error. How is this sort of error prevented? > > It''s very difficult to protect yourself from software bugs with the same > piece of software. You can create assertions that are hopefully simpler > and less prone to errors, but they will not catch all bugs. > >> Some errors might result from a loss of power if some ZFS data was >> written to a disk cache but never was written to the disk platter. >> Again, ZFS might panic on import or might correct the error. How is >> this sort of error prevented? > > ZFS uses a multi-stage commit. It relies on the "disk" responding to a > request to flush caches to the disk. If that assumption is correct, > then there is no problem in general with power issues. The disk is > consistent both before and after the cache is flushed. > > -- > Darren > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
folks, please, chatting on - don''t make me stop you, we are all open folks. [but darn] ok, thank you much for the anticipation for something actually useful, here is another thing I shared with MS Storage but not with you folks yet -- we win with real advantages, not lies, not scales, but only real knowhow. cheers, z ----- Original Message ----- From: "JZ" <jz at excelsioritsolutions.com> To: "A Darren Dunham" <ddunham at taos.com>; <zfs-discuss at opensolaris.org> Sent: Wednesday, January 14, 2009 7:38 PM Subject: Re: [zfs-discuss] What are the usual suspects in data errors?> darn, Darren, learning fast! > > best, > z > > > ----- Original Message ----- > From: "A Darren Dunham" <ddunham at taos.com> > To: <zfs-discuss at opensolaris.org> > Sent: Wednesday, January 14, 2009 6:15 PM > Subject: Re: [zfs-discuss] What are the usual suspects in data errors? > > >> On Wed, Jan 14, 2009 at 04:39:03PM -0600, Gary Mills wrote: >>> I realize that any error can occur in a storage subsystem, but most >>> of these have an extremely low probability. I''m interested in this >>> discussion in only those that do occur occasionally, and that are >>> not catastrophic. >> >> What level is "extremely low" here? >> >>> Many of those components have their own error checking. Some have >>> error correction. For example, parity checking is done on a SCSI bus, >>> unless it''s specifically disabled. Do SATA and PATA connections also >>> do error checking? Disk sector I/O uses CRC error checking and >>> correction. Memory buffers would often be protected by parity memory. >>> Is there any more that I''ve missed? >> >> Reports suggest that bugs in drive firmware could account for errors at >> a level that is not insignificant. >> >>> What can go wrong with the disk controller? A simple seek to the >>> wrong track is not a problem because the track number is encoded on >>> the platter. The controller will simply recalibrate the mechanism and >>> retry the seek. If it computes the wrong sector, that would be a >>> problem. Does this happen with any frequency? >> >> Netapp documents certain rewrite bugs that they''ve specifically seen. I >> would imagine they have good data on the frequency that they see it in >> the field. >> >>> In this case, ZFS >>> would detect a checksum error and obtain the data from its redundant >>> copy. >> >> Correct. >> >>> A logic error in ZFS might result in incorrect metadata being written >>> with valid checksum. In this case, ZFS might panic on import or might >>> correct the error. How is this sort of error prevented? >> >> It''s very difficult to protect yourself from software bugs with the same >> piece of software. You can create assertions that are hopefully simpler >> and less prone to errors, but they will not catch all bugs. >> >>> Some errors might result from a loss of power if some ZFS data was >>> written to a disk cache but never was written to the disk platter. >>> Again, ZFS might panic on import or might correct the error. How is >>> this sort of error prevented? >> >> ZFS uses a multi-stage commit. It relies on the "disk" responding to a >> request to flush caches to the disk. If that assumption is correct, >> then there is no problem in general with power issues. The disk is >> consistent both before and after the cache is flushed. >> >> -- >> Darren >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Folks, I am very sorry, for don''t know how to be not misleading. I was not challenging the Ten Commandments. That is a good code. And maybe the first one we need to follow. Vain and pride and arrogance and courage are all very different, but very similar. Before you can truely understand the code of love, you will have to be very careful. And then, there are beyond. Folks, this is a technology discussion, not a religious discussion. I love you all. I do not want to see you folks cannot make to your dream states with your technology knowhow just because you don''t even understand the basic code of love. Folks, I love you all. OMG, I did not teach King and High anything beyond what I have said here in open. It that not enough to make the darn open discussion go on? Please. [do you know if not because of the 400000 friends, I can be dead by now talking this much to an open list???] best, z ----- Original Message ----- From: "JZ" <jz at excelsioritsolutions.com> To: "A Darren Dunham" <ddunham at taos.com>; <zfs-discuss at opensolaris.org> Sent: Wednesday, January 14, 2009 7:48 PM Subject: Re: [zfs-discuss] What are the usual suspects in data errors?> folks, please, chatting on - don''t make me stop you, we are all open > folks. > > > [but darn] > > ok, thank you much for the anticipation for something actually useful, > here > is another thing I shared with MS Storage but not with you folks yet -- > > we win with real advantages, not lies, not scales, but only real knowhow. > > cheers, > z > > > > ----- Original Message ----- > From: "JZ" <jz at excelsioritsolutions.com> > To: "A Darren Dunham" <ddunham at taos.com>; <zfs-discuss at opensolaris.org> > Sent: Wednesday, January 14, 2009 7:38 PM > Subject: Re: [zfs-discuss] What are the usual suspects in data errors? > > >> darn, Darren, learning fast! >> >> best, >> z >> >> >> ----- Original Message ----- >> From: "A Darren Dunham" <ddunham at taos.com> >> To: <zfs-discuss at opensolaris.org> >> Sent: Wednesday, January 14, 2009 6:15 PM >> Subject: Re: [zfs-discuss] What are the usual suspects in data errors? >> >> >>> On Wed, Jan 14, 2009 at 04:39:03PM -0600, Gary Mills wrote: >>>> I realize that any error can occur in a storage subsystem, but most >>>> of these have an extremely low probability. I''m interested in this >>>> discussion in only those that do occur occasionally, and that are >>>> not catastrophic. >>> >>> What level is "extremely low" here? >>> >>>> Many of those components have their own error checking. Some have >>>> error correction. For example, parity checking is done on a SCSI bus, >>>> unless it''s specifically disabled. Do SATA and PATA connections also >>>> do error checking? Disk sector I/O uses CRC error checking and >>>> correction. Memory buffers would often be protected by parity memory. >>>> Is there any more that I''ve missed? >>> >>> Reports suggest that bugs in drive firmware could account for errors at >>> a level that is not insignificant. >>> >>>> What can go wrong with the disk controller? A simple seek to the >>>> wrong track is not a problem because the track number is encoded on >>>> the platter. The controller will simply recalibrate the mechanism and >>>> retry the seek. If it computes the wrong sector, that would be a >>>> problem. Does this happen with any frequency? >>> >>> Netapp documents certain rewrite bugs that they''ve specifically seen. I >>> would imagine they have good data on the frequency that they see it in >>> the field. >>> >>>> In this case, ZFS >>>> would detect a checksum error and obtain the data from its redundant >>>> copy. >>> >>> Correct. >>> >>>> A logic error in ZFS might result in incorrect metadata being written >>>> with valid checksum. In this case, ZFS might panic on import or might >>>> correct the error. How is this sort of error prevented? >>> >>> It''s very difficult to protect yourself from software bugs with the same >>> piece of software. You can create assertions that are hopefully simpler >>> and less prone to errors, but they will not catch all bugs. >>> >>>> Some errors might result from a loss of power if some ZFS data was >>>> written to a disk cache but never was written to the disk platter. >>>> Again, ZFS might panic on import or might correct the error. How is >>>> this sort of error prevented? >>> >>> ZFS uses a multi-stage commit. It relies on the "disk" responding to a >>> request to flush caches to the disk. If that assumption is correct, >>> then there is no problem in general with power issues. The disk is >>> consistent both before and after the cache is flushed. >>> >>> -- >>> Darren >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
ok, you open folks are really ????. just one more, and I hope someone replies so we can save some open time. the code of ?. and what is the relationship between ? and love? here, some public info - again, I am only saying this piece in Chinese is pretty-readable in my taste, not too much to attack, but hey, whoever wrote this, don''t be a hot head. [to that blog writter: "darn", if you also know what that means] http://blog.ce.cn/html/04/101804-15445.html and, someone on the list can you please provide a translated url to save some open time? every darn hour multiplied by the number of readers here, the help better comes fast, before I darn all of you open folks. [for the beloved ones, attached an even better code. There has been a tough nut, let me see if I can crack that nut with this public code. :-) ] z, at home, wondering why Daisy baby is not calling... not interested in the list discussion anymore ----- Original Message ----- From: "JZ" <jz at excelsioritsolutions.com> To: "A Darren Dunham" <ddunham at taos.com>; <zfs-discuss at opensolaris.org> Sent: Wednesday, January 14, 2009 8:38 PM Subject: Re: [zfs-discuss] What are the usual suspects in data errors?> Folks, I am very sorry, for don''t know how to be not misleading. > > I was not challenging the Ten Commandments. > That is a good code. And maybe the first one we need to follow. > > Vain and pride and arrogance and courage are all very different, but very > similar. > Before you can truely understand the code of love, you will have to be > very > careful. > > And then, there are beyond. > > Folks, this is a technology discussion, not a religious discussion. > I love you all. > I do not want to see you folks cannot make to your dream states with your > technology knowhow just because you don''t even understand the basic code > of > love. > > Folks, I love you all. > OMG, I did not teach King and High anything beyond what I have said here > in > open. > It that not enough to make the darn open discussion go on? > Please. > > > [do you know if not because of the 400000 friends, I can be dead by now > talking this much to an open list???] > > best, > z > > > ----- Original Message ----- > From: "JZ" <jz at excelsioritsolutions.com> > To: "A Darren Dunham" <ddunham at taos.com>; <zfs-discuss at opensolaris.org> > Sent: Wednesday, January 14, 2009 7:48 PM > Subject: Re: [zfs-discuss] What are the usual suspects in data errors? > > >> folks, please, chatting on - don''t make me stop you, we are all open >> folks. >> >> >> [but darn] >> >> ok, thank you much for the anticipation for something actually useful, >> here >> is another thing I shared with MS Storage but not with you folks yet -- >> >> we win with real advantages, not lies, not scales, but only real knowhow. >> >> cheers, >> z >> >> >> >> ----- Original Message ----- >> From: "JZ" <jz at excelsioritsolutions.com> >> To: "A Darren Dunham" <ddunham at taos.com>; <zfs-discuss at opensolaris.org> >> Sent: Wednesday, January 14, 2009 7:38 PM >> Subject: Re: [zfs-discuss] What are the usual suspects in data errors? >> >> >>> darn, Darren, learning fast! >>> >>> best, >>> z >>> >>> >>> ----- Original Message ----- >>> From: "A Darren Dunham" <ddunham at taos.com> >>> To: <zfs-discuss at opensolaris.org> >>> Sent: Wednesday, January 14, 2009 6:15 PM >>> Subject: Re: [zfs-discuss] What are the usual suspects in data errors? >>> >>> >>>> On Wed, Jan 14, 2009 at 04:39:03PM -0600, Gary Mills wrote: >>>>> I realize that any error can occur in a storage subsystem, but most >>>>> of these have an extremely low probability. I''m interested in this >>>>> discussion in only those that do occur occasionally, and that are >>>>> not catastrophic. >>>> >>>> What level is "extremely low" here? >>>> >>>>> Many of those components have their own error checking. Some have >>>>> error correction. For example, parity checking is done on a SCSI bus, >>>>> unless it''s specifically disabled. Do SATA and PATA connections also >>>>> do error checking? Disk sector I/O uses CRC error checking and >>>>> correction. Memory buffers would often be protected by parity memory. >>>>> Is there any more that I''ve missed? >>>> >>>> Reports suggest that bugs in drive firmware could account for errors at >>>> a level that is not insignificant. >>>> >>>>> What can go wrong with the disk controller? A simple seek to the >>>>> wrong track is not a problem because the track number is encoded on >>>>> the platter. The controller will simply recalibrate the mechanism and >>>>> retry the seek. If it computes the wrong sector, that would be a >>>>> problem. Does this happen with any frequency? >>>> >>>> Netapp documents certain rewrite bugs that they''ve specifically seen. >>>> I >>>> would imagine they have good data on the frequency that they see it in >>>> the field. >>>> >>>>> In this case, ZFS >>>>> would detect a checksum error and obtain the data from its redundant >>>>> copy. >>>> >>>> Correct. >>>> >>>>> A logic error in ZFS might result in incorrect metadata being written >>>>> with valid checksum. In this case, ZFS might panic on import or might >>>>> correct the error. How is this sort of error prevented? >>>> >>>> It''s very difficult to protect yourself from software bugs with the >>>> same >>>> piece of software. You can create assertions that are hopefully >>>> simpler >>>> and less prone to errors, but they will not catch all bugs. >>>> >>>>> Some errors might result from a loss of power if some ZFS data was >>>>> written to a disk cache but never was written to the disk platter. >>>>> Again, ZFS might panic on import or might correct the error. How is >>>>> this sort of error prevented? >>>> >>>> ZFS uses a multi-stage commit. It relies on the "disk" responding to a >>>> request to flush caches to the disk. If that assumption is correct, >>>> then there is no problem in general with power issues. The disk is >>>> consistent both before and after the cache is flushed. >>>> >>>> -- >>>> Darren >>>> _______________________________________________ >>>> zfs-discuss mailing list >>>> zfs-discuss at opensolaris.org >>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: Do Not Lie.wma Type: audio/x-ms-wma Size: 3553103 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090114/945d5568/attachment.bin>
Richard Elling
2009-Jan-15 04:00 UTC
[zfs-discuss] What are the usual suspects in data errors?
well, since this is part of how I make my living, or at least what is in my current job description... Gary Mills wrote:> I realize that any error can occur in a storage subsystem, but most > of these have an extremely low probability. I''m interested in this > discussion in only those that do occur occasionally, and that are > not catastrophic.excellent... fertile ground for research. One of the things that we see occur with ZFS is that it detects errors which were previously not detected. You can see this happen on this forum when people try to kill the canary (ZFS). I think a better analogy is astronomy: as our ability to see more of the universe gets better, we see more of the universe -- but that also raises the number of questions we can''t answer... well... yet...> Consider the common configuration of two SCSI disks connected to > the same HBA that are configured as a mirror in some manner. In this > case, the data path in general consists of:Beware of the Decomposition Law which says, the part is more than a fraction of the whole. This is what trips people up when they think that if every part performs flawlessly, then the whole will perform flawlessly.> o The application > o The filesystem > o The drivers > o The HBA > o The SCSI bus > o The controllers > o The heads and patters > > Many of those components have their own error checking. Some have > error correction. For example, parity checking is done on a SCSI bus, > unless it''s specifically disabled. Do SATA and PATA connections also > do error checking? Disk sector I/O uses CRC error checking and > correction. Memory buffers would often be protected by parity memory. > Is there any more that I''ve missed?thousands more ;-)> Now, let''s consider common errors. To me, the most frequent would > be a bit error on a disk sector. In this case, the controller would > report a CRC error and would not return bad data. The filesystem > would obtain the data from its redundant copy. I assume that ZFS > would also rewrite the bad sector to correct it. The application > would not see an error. Similar events would happen for a parity > error on the SCSI bus.Nit: modern disks can detect and correct multiple byte errors in a sector. If ZFS can correct it (depends on the ZFS configuration) then it will, but it will not rewrite the defective sector -- it will write to a different sector. While that seems better, it also introduces at least one new failure mode and can help to expose other, existing failure modes, such as phantom writes.> What can go wrong with the disk controller? A simple seek to the > wrong track is not a problem because the track number is encoded on > the platter. The controller will simply recalibrate the mechanism and > retry the seek. If it computes the wrong sector, that would be a > problem. Does this happen with any frequency? In this case, ZFS > would detect a checksum error and obtain the data from its redundant > copy. > > A logic error in ZFS might result in incorrect metadata being written > with valid checksum. In this case, ZFS might panic on import or might > correct the error. How is this sort of error prevented? > > If the application wrote bad data to the filesystem, none of the > error checking in lower layers would detect it. This would be > strictly an error in the application. > > Some errors might result from a loss of power if some ZFS data was > written to a disk cache but never was written to the disk platter. > Again, ZFS might panic on import or might correct the error. How is > this sort of error prevented? > > After all of this discussion, what other errors can ZFS checksums > reasonably detect? Certainly if some of the other error checking > failed to detect an error, ZFS would still detect one. How likely > are these other error checks to fail? > > Is there anything else I''ve missed in this analysis?Everything along the way. If you search the archives here you will find anecdotes of: + bad disks -- of all sorts + bad power supplies + bad FC switch firmware + flaky cables + bugs in NIC drivers + transient and permanent DRAM errors + and, of course, bugs in ZFS code Basically, anywhere you data touches can fail. However, to make the problem tractable, we often divide failures into two classifications: 1. mechanical, including quantum-mechanical 2. design or implementation, including software defects, design deficiencies, and manufacturing There is a lot of experience with measurements of mechanical failure modes, so we tend to have some ways to assign reliability budgets and predictions. For #2, the science we use for #1 doesn''t apply. -- richard
Miles Nordin
2009-Jan-15 17:33 UTC
[zfs-discuss] What are the usual suspects in data errors?
>>>>> "gm" == Gary Mills <mills at cc.umanitoba.ca> writes:gm> Is there any more that I''ve missed? 1. Filesystem/RAID layer dispatches writes ''aaaaaaaaa'' to iSCSI initiator. iSCSI initiator accepts them, buffers them, returns success to RAID layer. 2. iSCSI initiator sends to iSCSI target. iSCSI Target writes ''aaaaaaaa''. 3. Network connectivity is interrupted, target is rebooted, something like that. 4. Filesystem/RAID layer dispatches writes ''bbbbbbbb'' to iSCSI initiator. initiator accepts, buffers, returns success. 5. iSCSI initiator can''t write ''bbbbbbbb'' 6. iSCSI initiator goes through some cargo-cult error-recovery scheme. retry this 3 times, timeout, disconnect, reconnect, retry really-hard 5 times, timeout, return various errors to RAID layer, maybe. 7. OH! Target''s back! good. 8. Filesystem/RAID layer writes ''ccccccccc'' to iSCSI initiator. maybe gets an error. maybe flags ''ccccccccc'' destination blocks bad, increments RAID-layer coutners, tries to ``rewrite'''' the ''cccccccc'', eventually gets success back from the initiator. 9. Filesystem/RAID layer issues SYNCHRONIZE CACHE to the iSCSI initiator. 10. initiator flushes ''cccccccc'' to the target, and waits for target to confirm ''ccccccc'' and all previous writes are on physical media. 11. initiator returns success for the SYNCHRONIZE CACHE command. 12. Filesystem/RAID layer writes ''d'' commit sector updating pointers, aiming various important things at ''bbbbbbbbb'' Now, the RAID layer thinks ''aaaaaaaaa'' and ''bbbbbbbbb'' and ''ccccccccc'' and ''d'' are all written, but in fact only ''aaaaaaaaa'' and ''cccccccccc'' and ''d'' are written, and ''d'' points at garbage. NFS has a state machine designed to handle server reboots without breaking any consistency promises. Substitute ``the userland app'''' for Filesystem/RAID, and ``NFSv3 client'''' for iSCSI initiator. The NFSv3 client will keep track of which writes are actually committed to disk and batch them into commit blocks of which the userland app is entirely unaware. The NFS client won''t free a commit block from its RAM write cache until it''s on disk. If the server reboots it will replay the open commit blocks. If the server AND client reboot the commit block will be lost from RAM, but then ''d'' are not written, so the datastore is not corrupt. The iSCSI initiator probably needs to do something similar to NFSv3 to enforce that success from SYNCHRONIZE CACHE really means what ZFS thinks it means. It''s a little trickier to do this with ZFS/iSCSI because the NFS cop-out was to use ''hard'' mounts---you _never_ propogate write failures up the stack. You just freeze the application until you can finally complete the write, and if you can''t write you evade the consistency guarantees by killing the app. Then, it''s a solveable problem to design apps that won''t corrupt their datastores when they''re killed, so the overall system works. This world order won''t work analagously for ZFS-on-iSCSI which needs to see failures to handle redundancy. We may even need some new kind of failure code to solve the problem, but maybe something clever can be crammed into the old API. Imagine the stream of writes to a disk as a bucket-brigade separated by SYNCHRONIZE CACHE commands. The writes within each bucket can be sloshed around (reordered) arbitrarily. And if the machine crashes, we might pour _part_ of the water in the last bucket on the fire, but then we stop and drop all the other buckets. So far, we can handle it. But we''ve no way to handle the situation where someone in the _middle_ of the brigade spills the water in his bucket. There''s no way to cleanly restart the brigade after this happens. ZFS needs to gracefully handle a SYNCHRONIZE CACHE command that returns _failure_, and needs to interpret such a failure really aggressively, as in: Any writes you issued since the last SYNCHRONIZE CACHE, *even if you got a Success return to your block-layer write() command*, may or may not be committed to disk, and waiting will NOT change the situation---they''re just gone. But, the disk is still here, and is working, meh, ~fine. This failure is not ``retryable''''. If you issue a second SYNCHRONIZE CACHE command, and it Succeeds, that does NOT change what I''ve just told you. That Success only referrs to writes issued between this failing SYNCHRONIZE CACHE command and the next one. Once iSCSI initiator is fixed, probably we need to go back and add NFS-style commit batches to even SATA disk drivers which can suffer the same problem if you hot-swap them, or maybe even if you don''t hot-swap but the disk reports some error which invokes some convoluted sd/ssd exception handling involving ``resets''''. The assumption doesn''t hold that write, write, write, synchronize cache promises all those writes are on-disk once synchronize cache returns. The only way to make it hold is to promise to panic the kernel whenever any disk, controller, bus, or iscsi session is ``reset''''---the simple, obvious ``SYNCHRONIZE CACHE is the final word of God'''' assumption ought to handle cord-yanking just fine, but not smaller failures. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090115/707d4e3a/attachment.bin>
Kees Nuyt
2009-Jan-15 20:15 UTC
[zfs-discuss] a Min Wang person emailed me for free knowledge
On Wed, 14 Jan 2009 22:40:19 -0500, "JZ" <jz at excelsioritsolutions.com> wrote:>ok, you open folks are really ????. >just one more, and I hope someone replies so we can save some open time.[snip] JZ, would you please be so kind to refrain from including any attachments in your postings to our beloved zfs-discuss at opensolaris.org , especially large, binary ones? They are not welcome here, and I''m pretty sure I''m not the only one with that opinion. Thanks in advance for your cooperation. Regards, -- ( Kees Nuyt ) c[_]
[last 5 minutes on my lunch, just to say thank you and sorry] Yes, I was wondering how the first one even made to the list. None of those emails with large attachments should have been approved by the mail server policy. And I feel bad that I tested the server with some bad text and those got through. best, z ----- Original Message ----- From: "Kees Nuyt" <k.nuyt at zonnet.nl> To: <zfs-discuss at opensolaris.org> Sent: Thursday, January 15, 2009 3:15 PM Subject: Re: [zfs-discuss] a Min Wang person emailed me for free knowledge> On Wed, 14 Jan 2009 22:40:19 -0500, "JZ" > <jz at excelsioritsolutions.com> wrote: > >>ok, you open folks are really ????. >>just one more, and I hope someone replies so we can save some open time. > > [snip] > > JZ, would you please be so kind to refrain from including > any attachments in your postings to our beloved > zfs-discuss at opensolaris.org , especially large, binary ones? > > They are not welcome here, and I''m pretty sure I''m not the > only one with that opinion. > > Thanks in advance for your cooperation. > > Regards, > -- > ( Kees Nuyt > ) > c[_] > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Sorry folks, the mail server is just too advanced for me. Tfhis email was in the server since 1/14/2009, and coming through now would be very misleading. A side-method for Zhou is that (not for anyone to learn) when trouble comes and we are not ready to deal with the trouble, we just hide behind ladies since that wall provides better protection than any private network (due to the noise of my wall). This Daisy baby has no personal relationship with me. She is a professional. And Mohegan Sun is not my friend. [if you would like to play in that area, I say Foxwood, because Foxwood is my friend. Even though Daisy is my friend, I will not say Mohegan Sun is my friend because Mohegan Sun is not.] I used Daisy''s name for cover, and that was all. [may not be a correct thing to do in many views, but what is done is done, if I should be punished, I will take that punishment from the sky with happiness, not a big deal.] So please, focus on the ? part, not the Daisy baby part, in my post on that day. And again, I had limited respect for the Solaris mail server and tested the server policy with some ridiculous methods. If I had offended anyone in that process, I am sorry. Best, z ----- Original Message ----- From: "JZ" <jz at excelsioritsolutions.com> To: "A Darren Dunham" <ddunham at taos.com>; <zfs-discuss at opensolaris.org> Cc: <schen at mohegansun.com>; "Marvin Wang, Min" <mail2wm at yahoo.com> Sent: Wednesday, January 14, 2009 10:40 PM Subject: Re: [zfs-discuss] a Min Wang person emailed me for free knowledge ok, you open folks are really ????. just one more, and I hope someone replies so we can save some open time. the code of ?. and what is the relationship between ? and love? here, some public info - again, I am only saying this piece in Chinese is pretty-readable in my taste, not too much to attack, but hey, whoever wrote this, don''t be a hot head. [to that blog writter: "darn", if you also know what that means] http://blog.ce.cn/html/04/101804-15445.html and, someone on the list can you please provide a translated url to save some open time? every darn hour multiplied by the number of readers here, the help better comes fast, before I darn all of you open folks. [for the beloved ones, attached an even better code. There has been a tough nut, let me see if I can crack that nut with this public code. :-) ] z, at home, wondering why Daisy baby is not calling... not interested in the list discussion anymore ----- Original Message ----- From: "JZ" <jz at excelsioritsolutions.com> To: "A Darren Dunham" <ddunham at taos.com>; <zfs-discuss at opensolaris.org> Sent: Wednesday, January 14, 2009 8:38 PM Subject: Re: [zfs-discuss] What are the usual suspects in data errors?> Folks, I am very sorry, for don''t know how to be not misleading. > > I was not challenging the Ten Commandments. > That is a good code. And maybe the first one we need to follow. > > Vain and pride and arrogance and courage are all very different, but very > similar. > Before you can truely understand the code of love, you will have to be > very > careful. > > And then, there are beyond. > > Folks, this is a technology discussion, not a religious discussion. > I love you all. > I do not want to see you folks cannot make to your dream states with your > technology knowhow just because you don''t even understand the basic code > of > love. > > Folks, I love you all. > OMG, I did not teach King and High anything beyond what I have said here > in > open. > It that not enough to make the darn open discussion go on? > Please. > > > [do you know if not because of the 400000 friends, I can be dead by now > talking this much to an open list???] > > best, > z > > > ----- Original Message ----- > From: "JZ" <jz at excelsioritsolutions.com> > To: "A Darren Dunham" <ddunham at taos.com>; <zfs-discuss at opensolaris.org> > Sent: Wednesday, January 14, 2009 7:48 PM > Subject: Re: [zfs-discuss] What are the usual suspects in data errors? > > >> folks, please, chatting on - don''t make me stop you, we are all open >> folks. >> >> >> [but darn] >> >> ok, thank you much for the anticipation for something actually useful, >> here >> is another thing I shared with MS Storage but not with you folks yet -- >> >> we win with real advantages, not lies, not scales, but only real knowhow. >> >> cheers, >> z >> >> >> >> ----- Original Message ----- >> From: "JZ" <jz at excelsioritsolutions.com> >> To: "A Darren Dunham" <ddunham at taos.com>; <zfs-discuss at opensolaris.org> >> Sent: Wednesday, January 14, 2009 7:38 PM >> Subject: Re: [zfs-discuss] What are the usual suspects in data errors? >> >> >>> darn, Darren, learning fast! >>> >>> best, >>> z >>> >>> >>> ----- Original Message ----- >>> From: "A Darren Dunham" <ddunham at taos.com> >>> To: <zfs-discuss at opensolaris.org> >>> Sent: Wednesday, January 14, 2009 6:15 PM >>> Subject: Re: [zfs-discuss] What are the usual suspects in data errors? >>> >>> >>>> On Wed, Jan 14, 2009 at 04:39:03PM -0600, Gary Mills wrote: >>>>> I realize that any error can occur in a storage subsystem, but most >>>>> of these have an extremely low probability. I''m interested in this >>>>> discussion in only those that do occur occasionally, and that are >>>>> not catastrophic. >>>> >>>> What level is "extremely low" here? >>>> >>>>> Many of those components have their own error checking. Some have >>>>> error correction. For example, parity checking is done on a SCSI bus, >>>>> unless it''s specifically disabled. Do SATA and PATA connections also >>>>> do error checking? Disk sector I/O uses CRC error checking and >>>>> correction. Memory buffers would often be protected by parity memory. >>>>> Is there any more that I''ve missed? >>>> >>>> Reports suggest that bugs in drive firmware could account for errors at >>>> a level that is not insignificant. >>>> >>>>> What can go wrong with the disk controller? A simple seek to the >>>>> wrong track is not a problem because the track number is encoded on >>>>> the platter. The controller will simply recalibrate the mechanism and >>>>> retry the seek. If it computes the wrong sector, that would be a >>>>> problem. Does this happen with any frequency? >>>> >>>> Netapp documents certain rewrite bugs that they''ve specifically seen. >>>> I >>>> would imagine they have good data on the frequency that they see it in >>>> the field. >>>> >>>>> In this case, ZFS >>>>> would detect a checksum error and obtain the data from its redundant >>>>> copy. >>>> >>>> Correct. >>>> >>>>> A logic error in ZFS might result in incorrect metadata being written >>>>> with valid checksum. In this case, ZFS might panic on import or might >>>>> correct the error. How is this sort of error prevented? >>>> >>>> It''s very difficult to protect yourself from software bugs with the >>>> same >>>> piece of software. You can create assertions that are hopefully >>>> simpler >>>> and less prone to errors, but they will not catch all bugs. >>>> >>>>> Some errors might result from a loss of power if some ZFS data was >>>>> written to a disk cache but never was written to the disk platter. >>>>> Again, ZFS might panic on import or might correct the error. How is >>>>> this sort of error prevented? >>>> >>>> ZFS uses a multi-stage commit. It relies on the "disk" responding to a >>>> request to flush caches to the disk. If that assumption is correct, >>>> then there is no problem in general with power issues. The disk is >>>> consistent both before and after the cache is flushed. >>>> >>>> -- >>>> Darren >>>> _______________________________________________ >>>> zfs-discuss mailing list >>>> zfs-discuss at opensolaris.org >>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss--------------------------------------------------------------------------------> _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >