On a Linux forum, Ive spoken about ZFS end to end data integrity. I wrote things as "upon writing data to disc, ZFS reads it back and compares to the data in RAM and corrects it otherwise". I also wrote that ordinary HW raid doesnt do this check. After a heated discussion, I now start to wonder if I this is correct? Am I wrong? So, do ordinary HW raid check data correctness? The Linux guys wants to now this. For instance, Adaptec''s HW raid controllers doesnt do a check? Anyone knows more on this? Several Linux guys wants to try out ZFS now. :o) -- This message posted from opensolaris.org
Bob Friesenhahn
2008-Dec-28 17:43 UTC
[zfs-discuss] ZFS vs HardWare raid - data integrity?
On Sun, 28 Dec 2008, Orvar Korvar wrote:> On a Linux forum, Ive spoken about ZFS end to end data integrity. I > wrote things as "upon writing data to disc, ZFS reads it back and > compares to the data in RAM and corrects it otherwise". I also wrote > that ordinary HW raid doesnt do this check. After a heated > discussion, I now start to wonder if I this is correct? Am I wrong?You are somewhat wrong. When ZFS writes the data, it also stores a checksum for the data. When the data is read, it is checksummed again and the checksum is verified against the stored checksum. It is not possible to compare with data in RAM since usually the RAM memory is too small to cache the entire disk, and it would not survive reboots.> So, do ordinary HW raid check data correctness? The Linux guys wants > to now this. For instance, Adaptec''s HW raid controllers doesnt do a > check? Anyone knows more on this?My understanding is that ordinary HW raid does not check data correctness. If the hardware reports failure to successfully read a block, then a simple algorithm is used to (hopefully) re-create the lost data based on data from other disks. The difference here is that ZFS does check the data correctness (at the CPU) for each read while HW raid depends on the hardware detecting a problem, and even if the data is ok when read from disk, it may be corrupted by the time it makes it to the CPU. ZFS''s scrub algorithm forces all of the written data to be read, with validation against the stored checksum. If a problem is found, then an attempt to correct is made from redundant storage using traditional RAID methods. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Carsten Aulbert
2008-Dec-28 17:50 UTC
[zfs-discuss] ZFS vs HardWare raid - data integrity?
Hi all, Bob Friesenhahn wrote:> My understanding is that ordinary HW raid does not check data > correctness. If the hardware reports failure to successfully read a > block, then a simple algorithm is used to (hopefully) re-create the > lost data based on data from other disks. The difference here is that > ZFS does check the data correctness (at the CPU) for each read while > HW raid depends on the hardware detecting a problem, and even if the > data is ok when read from disk, it may be corrupted by the time it > makes it to the CPU.AFAIK this is not done during the normal operation (unless a disk asked for a sector cannot get this sector).> > ZFS''s scrub algorithm forces all of the written data to be read, with > validation against the stored checksum. If a problem is found, then > an attempt to correct is made from redundant storage using traditional > RAID methods.That''s exactly what volume checking for standard HW controllers does as well. Read all data and compare it with parity. This is exactly the point why RAID6 should always be chosen over RAID5, because in the event of a wrong parity check and RAID5 the controller can only say, oops, I have found a problem but cannot correct it - since it does not know if the parity is correct or any of the n data bits. In RAID6 you have redundant parity, thus the controller can find out if the parity was correct or not. At least I think that to be true for Areca controllers :) Cheers Carsten
Bob Friesenhahn
2008-Dec-28 18:11 UTC
[zfs-discuss] ZFS vs HardWare raid - data integrity?
On Sun, 28 Dec 2008, Carsten Aulbert wrote:>> ZFS does check the data correctness (at the CPU) for each read while >> HW raid depends on the hardware detecting a problem, and even if the >> data is ok when read from disk, it may be corrupted by the time it >> makes it to the CPU. > > AFAIK this is not done during the normal operation (unless a disk asked > for a sector cannot get this sector).ZFS checksum validates all returned data. Are you saying that this fact is incorrect?> That''s exactly what volume checking for standard HW controllers does as > well. Read all data and compare it with parity.What if the data was corrupted prior to parity generation?> This is exactly the point why RAID6 should always be chosen over RAID5, > because in the event of a wrong parity check and RAID5 the controller > can only say, oops, I have found a problem but cannot correct it - since > it does not know if the parity is correct or any of the n data bits. In > RAID6 you have redundant parity, thus the controller can find out if the > parity was correct or not. At least I think that to be true for Areca > controllers :)Good point. Luckily, ZFS''s raidz does not have this problem since it is able to tell if the "corrected" data is actually correct (within the checksum computation''s margin for error). If applying parity does not result in the correct checksum, then it knows that the data is toast. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Carsten Aulbert
2008-Dec-28 18:37 UTC
[zfs-discuss] ZFS vs HardWare raid - data integrity?
Hi Bob, Bob Friesenhahn wrote:>> AFAIK this is not done during the normal operation (unless a disk asked >> for a sector cannot get this sector). > > ZFS checksum validates all returned data. Are you saying that this fact > is incorrect? >No sorry, too long in front of a computer today I guess: I was referring to hardware RAID controllers, AFAIK these usually do not check the validity of data unless a disc returns an error. My knowledge regarding ZFS is exactly that, that data is checked in the CPU against the stored checksum.>> That''s exactly what volume checking for standard HW controllers does as >> well. Read all data and compare it with parity. > > What if the data was corrupted prior to parity generation? >Well, that is bad luck, same is true if your ZFS box has faulty memory and the computed checksum is right for the data on disk, but wrong in the sense of the file under consideration. Sorry for the confusion Cheers Carsten
This is good information guys. Do we have some more facts and links about HW raid and it''s data integrity, or lack of? -- This message posted from opensolaris.org
Nice discussion. Let my chip in my old timer view -- Until a few years ago, the understanding of "HW RAID doesn''t proactively check for consistency of data vs. parity unless required" was true. But LSI had added background consistency check (auto starts 5 mins after the drive is created) on its RAID cards. Since Sun is primarily selling LSI HW RAID cards, I guess at that high level, both HW RAID and ZFS provides some proactive consistency/integrity assurance. HOWEVER, I really think the ZFS way is much more advanced (PiT integrated) and can be used with other ZFS ECC/EDC features with memory-based data consistency/inegrity assurance, to achieve an overall considerably better data availability and business continuity. I guess I just like the "enterprise flavor" as such. ;-) Below are some tech details. -- again, please, do not compare HW RAID with ZFS at features level. RAID was invented for both data protection and performance, and there are different ways to do those with ZFS, resulting in very different solution architectures (according to the customer segments, and sometimes it could be beneficial to use HW RAID, e.g. when hetero HW RAID disks are deployed in a unified fashion and ZFS does not handle the enterprise-wide data protection......). ZFS does automatic error correction even when using a single hard drive, including by using end-to-end checksumming, separating the checksum from the file, and using copy-on-write redundancy so it is always both verifying the data and creating another copy (not overwriting) when writing a change to a file. Sun Distinguished Engineer Bill Moore developed ZFS: ... one of the design principles we set for ZFS was: never, ever trust the underlying hardware. As soon as an application generates data, we generate a checksum for the data while we''re still in the same fault domain where the application generated the data, running on the same CPU and the same memory subsystem. Then we store the data and the checksum separately on disk so that a single failure cannot take them both out. When we read the data back, we validate it against that checksum and see if it''s indeed what we think we wrote out before. If it''s not, we employ all sorts of recovery mechanisms. Because of that, we can, on very cheap hardware, provide more reliable storage than you could get with the most reliable external storage. It doesn''t matter how perfect your storage is, if the data gets corrupted in flight - and we''ve actually seen many customer cases where this happens - then nothing you can do can recover from that. With ZFS, on the other hand, we can actually authenticate that we got the right answer back and, if not, enact a bunch of recovery scenarios. That''s data integrity." See more details about ZFS Data Integrity and Security. Best, z ----- Original Message ----- From: "Orvar Korvar" <knatte_fnatte_tjatte at yahoo.com> To: <zfs-discuss at opensolaris.org> Sent: Sunday, December 28, 2008 4:16 PM Subject: Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?> This is good information guys. Do we have some more facts and links about > HW raid and it''s data integrity, or lack of? > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
The hyper links didn''t work, here are the urls -- http://queue.acm.org/detail.cfm?id=1317400 http://www.sun.com/bigadmin/features/articles/zfs_part1.scalable.jsp#integrity ----- Original Message ----- From: "JZ" <jz at excelsioritsolutions.com> To: "Orvar Korvar" <knatte_fnatte_tjatte at yahoo.com>; <zfs-discuss at opensolaris.org> Sent: Sunday, December 28, 2008 7:50 PM Subject: Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?> Nice discussion. Let my chip in my old timer view -- > > Until a few years ago, the understanding of "HW RAID doesn''t proactively > check for consistency of data vs. parity unless required" was true. But > LSI had added background consistency check (auto starts 5 mins after the > drive is created) on its RAID cards. Since Sun is primarily selling LSI > HW > RAID cards, I guess at that high level, both HW RAID and ZFS provides some > proactive consistency/integrity assurance. > > HOWEVER, I really think the ZFS way is much more advanced (PiT integrated) > and can be used with other ZFS ECC/EDC features with memory-based data > consistency/inegrity assurance, to achieve an overall considerably better > data availability and business continuity. I guess I just like the > "enterprise flavor" as such. ;-) > > > > > Below are some tech details. -- again, please, do not compare HW RAID with > ZFS at features level. RAID was invented for both data protection and > performance, and there are different ways to do those with ZFS, resulting > in > very different solution architectures (according to the customer segments, > and sometimes it could be beneficial to use HW RAID, e.g. when hetero HW > RAID disks are deployed in a unified fashion and ZFS does not handle the > enterprise-wide data protection......). > > > ZFS does automatic error correction even when using a single hard drive, > including by using end-to-end checksumming, separating the checksum from > the > file, and using copy-on-write redundancy so it is always both verifying > the > data and creating another copy (not overwriting) when writing a change to > a > file. > Sun Distinguished Engineer Bill Moore developed ZFS: > > ... one of the design principles we set for ZFS was: never, ever trust > the > underlying hardware. As soon as an application generates data, we generate > a > checksum for the data while we''re still in the same fault domain where the > application generated the data, running on the same CPU and the same > memory > subsystem. Then we store the data and the checksum separately on disk so > that a single failure cannot take them both out. > > When we read the data back, we validate it against that checksum and see > if it''s indeed what we think we wrote out before. If it''s not, we employ > all > sorts of recovery mechanisms. Because of that, we can, on very cheap > hardware, provide more reliable storage than you could get with the most > reliable external storage. It doesn''t matter how perfect your storage is, > if > the data gets corrupted in flight - and we''ve actually seen many customer > cases where this happens - then nothing you can do can recover from that. > With ZFS, on the other hand, we can actually authenticate that we got the > right answer back and, if not, enact a bunch of recovery scenarios. That''s > data integrity." > > See more details about ZFS Data Integrity and Security. > > > Best, > z > > > > ----- Original Message ----- > From: "Orvar Korvar" <knatte_fnatte_tjatte at yahoo.com> > To: <zfs-discuss at opensolaris.org> > Sent: Sunday, December 28, 2008 4:16 PM > Subject: Re: [zfs-discuss] ZFS vs HardWare raid - data integrity? > > >> This is good information guys. Do we have some more facts and links about >> HW raid and it''s data integrity, or lack of? >> -- >> This message posted from opensolaris.org >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
BTW, the following text from another discussion may be helpful towards your concerns. What to use for RAID is not a fixed answer, but using ZFS can be a good thing for many cases and reasons, such as the price/performance concern as Bob highlighted. And note Bob said "client OSs". To me, that should read "host OSs", since again, I am an enterprise guy, and my ideal way of using ZFS may differ from most folks today. To me, I would take ZFS for SAN-based virtualization and as a file/IP-block services gateway to applications (and file services to clients is one of the "enterprise applications" by me.) For example, I would then use different implementations for CIFS and NFS serving, not using the ZFS native NAS support to clients, but the ZFS storage pooling and SAM-FS management features. (I would use ZFS in a 6920 fashion, if you don''t know what I am talking about -- http://searchstorage.techtarget.com/news/article/0,289142,sid5_gci1245572,00.html) Sorry, I don''t want to lead the discussion into file systems and NFV, but to me, ZFS is very close to WAFL design point, and the file system involvements in RAID and PiT and HSM/ILM and application/data security/protection and HA/BC functions are vital. :-) z ___________________________ I do agree that when multiple client OSs are involved it is still useful if storage looks like a legacy disk drive. Luckly Solaris already offers iSCSI in Solaris 10 and OpenSolaris is now able to offer high performance fiber channel target and fiber channel over ethernet layers on top of reliable ZFS. The full benefit of ZFS is not provided, but the storage is successfully divorced from the client with a higher degree of data reliability and performance than is available from current firmware based RAID arrays. Bob =====================================Bob Friesenhahn ----- Original Message ----- From: "JZ" <jz at excelsioritsolutions.com> To: "Orvar Korvar" <knatte_fnatte_tjatte at yahoo.com>; <zfs-discuss at opensolaris.org> Sent: Sunday, December 28, 2008 7:55 PM Subject: Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?> The hyper links didn''t work, here are the urls -- > > http://queue.acm.org/detail.cfm?id=1317400 > > http://www.sun.com/bigadmin/features/articles/zfs_part1.scalable.jsp#integrity > > >>> This message posted from opensolaris.org >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
To answer original post, simple answer: Almost all old RAID designs have holes in their logic where they are insufficiently paranoid on the writes or read, and sometimes both. One example is the infamous RAID-5 write hole. Look at simple example of mirrored SVM versus ZFS in page 15&16 of this presentation: http://opensolaris.org/os/community/zfs/docs/zfs_last.pdf Critical metadata is triple duped, and all metadata is at least double-duped on even a single disk configuration. Almost all other filesystems are kludges with insufficient paranoia by default, and only become sufficiently paranoid by twiddling knobs & adding things like EMC did. After using ZFS for a while there is no other filesystem as good. I haven''t played with Linux BTRFS though maybe it has some good stuff but last I heard was still in alpha. -- This message posted from opensolaris.org
Carsten Aulbert <carsten.aulbert <at> aei.mpg.de> writes:> > In RAID6 you have redundant parity, thus the controller can find out > if the parity was correct or not. At least I think that to be true > for Areca controllers :)Are you sure about that ? The latest research I know of [1] says that although an algorithm does exist to theoretically recover from single-disk corruption in the case of RAID-6, it is *not* possible to detect dual-disk corruption with 100% certainty. And blindly running the said algorithm in such a case would even introduce corruption on a third disk. This is the reason why, AFAIK, no RAID-6 implementation actually attempts to recover from single-disk corruption (someone correct me if I am wrong). The exception is ZFS of course, but it accomplishes single and dual-disk corruption self-healing by using its own checksum, which is one layer above RAID-6 (therefore unrelated to it). [1] http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf -marc
Carsten Aulbert
2008-Dec-30 10:30 UTC
[zfs-discuss] ZFS vs HardWare raid - data integrity?
Hi Marc, Marc Bevand wrote:> Carsten Aulbert <carsten.aulbert <at> aei.mpg.de> writes: >> In RAID6 you have redundant parity, thus the controller can find out >> if the parity was correct or not. At least I think that to be true >> for Areca controllers :) > > Are you sure about that ? The latest research I know of [1] says that > although an algorithm does exist to theoretically recover from > single-disk corruption in the case of RAID-6, it is *not* possible to > detect dual-disk corruption with 100% certainty. And blindly running > the said algorithm in such a case would even introduce corruption on a > third disk. >Well, I probably need to wade through the paper (and recall Galois field theory) before answering this. We did a few tests in a 16 disk RAID6 where we wrote data to the RAID, powered the system down, pulled out one disk, inserted it into another computer and changed the sector checksum of a few sectors (using hdparm''s utility makebadsector). The we reinserted this into the original box, powered it up and ran a volume check and the controller did indeed find the corrupted sector and repaired the correct one without destroying data on another disk (as far as we know and tested). For the other point: dual-disk corruption can (to my understanding) never be healed by the controller since there is no redundant information available to check against. I don''t recall if we performed some tests on that part as well, but maybe we should do that to learn how the controller will behave. As a matter of fact at that point it should just start crying out loud and tell me, that it cannot recover for that. But the chance of this happening should be relatively small unless the backplane/controller had a bad hiccup when writing that stripe.> This is the reason why, AFAIK, no RAID-6 implementation actually > attempts to recover from single-disk corruption (someone correct me if > I am wrong). >As I said I know that our Areca 1261ML does detect and correct those errors - if these are single-disk corruptions> The exception is ZFS of course, but it accomplishes single and > dual-disk corruption self-healing by using its own checksum, which is > one layer above RAID-6 (therefore unrelated to it).Yes, very helpful and definitely desirable to have :)> > [1] http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdfThanks for the pointer Cheers Carsten
Carsten Aulbert <carsten.aulbert <at> aei.mpg.de> writes:> > Well, I probably need to wade through the paper (and recall Galois field > theory) before answering this. We did a few tests in a 16 disk RAID6 > where we wrote data to the RAID, powered the system down, pulled out one > disk, inserted it into another computer and changed the sector checksum > of a few sectors (using hdparm''s utility makebadsector). The we > reinserted this into the original box, powered it up and ran a volume > check and the controller did indeed find the corrupted sector and > repaired the correct one without destroying data on another disk (as far > as we know and tested).Note that there are cases of single-disk corruption that are trivially recoverable (for example if the corruption affects the P or Q parity block, as opposed to the data blocks). Maybe that''s what you inadvertently tested ? Overwrite a number of contiguous sectors to span 3 stripes on a single disk to be sure to correctly stress-test the self-healing mechanism.> For the other point: dual-disk corruption can (to my understanding) > never be healed by the controller since there is no redundant > information available to check against. I don''t recall if we performed > some tests on that part as well, but maybe we should do that to learn > how the controller will behave. As a matter of fact at that point it > should just start crying out loud and tell me, that it cannot recover > for that.The paper explains that the best RAID-6 can do is use probabilistic methods to distinguish between single and dual-disk corruption, eg. "there are 95% chances it is single-disk corruption so I am going to fix it assuming that, but there are 5% chances I am going to actually corrupt more data, I just can''t tell". I wouldn''t want to rely on a RAID controller that takes gambles :-) -marc
Mattias Pantzare
2008-Dec-30 23:00 UTC
[zfs-discuss] ZFS vs HardWare raid - data integrity?
On Tue, Dec 30, 2008 at 11:30, Carsten Aulbert <carsten.aulbert at aei.mpg.de> wrote:> Hi Marc, > > Marc Bevand wrote: >> Carsten Aulbert <carsten.aulbert <at> aei.mpg.de> writes: >>> In RAID6 you have redundant parity, thus the controller can find out >>> if the parity was correct or not. At least I think that to be true >>> for Areca controllers :) >> >> Are you sure about that ? The latest research I know of [1] says that >> although an algorithm does exist to theoretically recover from >> single-disk corruption in the case of RAID-6, it is *not* possible to >> detect dual-disk corruption with 100% certainty. And blindly running >> the said algorithm in such a case would even introduce corruption on a >> third disk. >> > > Well, I probably need to wade through the paper (and recall Galois field > theory) before answering this. We did a few tests in a 16 disk RAID6 > where we wrote data to the RAID, powered the system down, pulled out one > disk, inserted it into another computer and changed the sector checksum > of a few sectors (using hdparm''s utility makebadsector). The we > reinserted this into the original box, powered it up and ran a volume > check and the controller did indeed find the corrupted sector and > repaired the correct one without destroying data on another disk (as far > as we know and tested).You are talking about diffrent types of errors. You tested errors that the disk can detect. That is not a problem on any RAID, that is what it is designed for. He was talking about errors that the disk can''t detect (errors introduced by other parts of the system, writes to the wrong sector or very bad luck). You can simulate that by writing diffrent data to the sector,
Que? So what can we deduce about HW raid? There are some controller cards that do background concistency checks? And error detection of various kind? -- This message posted from opensolaris.org
Orvar, did you see my post on consistency check and data integrity? It does not matter what HW RAID has, the point is what HW RAID does not have... Please, for the respect for Bill, please study, here are more. THE LAST WORD IN FILE SYSTEMS http://www.sun.com/software/solaris/zfs_lc_preso.pdf Best, z ----- Original Message ----- From: "Orvar Korvar" <knatte_fnatte_tjatte at yahoo.com> To: <zfs-discuss at opensolaris.org> Sent: Tuesday, December 30, 2008 8:21 PM Subject: Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?> Que? So what can we deduce about HW raid? There are some controller cards > that do background concistency checks? And error detection of various > kind? > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Mattias Pantzare <pantzare <at> gmail.com> writes:> > He was talking about errors that the disk can''t detect (errors > introduced by other parts of the system, writes to the wrong sector or > very bad luck). You can simulate that by writing diffrent data to the > sector,Well yes you can. Carsten and I are both talking about silent data corruption errors, and the way to simulate them is to do what Carsten did. However I pointed out that he may have tested only easy corruption cases (affecting the P or Q parity only) -- it is tricky to simulate hard-to-recover corruption errors... -marc
Ive studied all links here. But I want information of the HW raid controllers. Not about ZFS, because I have plenty of ZFS information now. The closest thing I got was www.baarf.org Where in one article he states that "raid5 never does parity check on reads". Ive wrote that to the Linux guys. And also "raid6 guesses when it tries to repair some errors with a chance of corrupting more". Thats hard facts. Anymore? -- This message posted from opensolaris.org
Orvar Korvar wrote:> Ive studied all links here. But I want information of the HW raid controllers. Not about ZFS, because I have plenty of ZFS information now. The closest thing I got was > www.baarf.org >[one of my favorite sites ;-)] The problem is that there is no such thing as "hardware RAID" there is only "software RAID." The "HW RAID" controllers are processors running software and the features of the product are therefore limited by the software developer and processor capabilities. I goes without saying that the processors are very limited, compared to the main system CPU found on modern machines. It also goes without saying that the software (or firmware, if you prefer) is closed. Good luck cracking that nut.> Where in one article he states that "raid5 never does parity check on reads". Ive wrote that to the Linux guys. And also "raid6 guesses when it tries to repair some errors with a chance of corrupting more". Thats hard facts. >The high-end RAID arrays have better, more expensive processors and a larger feature set. Some even add block-level checksumming, which has led to some fascinating studies on field failures. But I think it is safe to assume that those features will not exist on the low-end systems for some time. -- richard
There is a company (DataCore Software) that has been making / shipping products for many years that I believe would help in this area. I''ve used them before, they''re very solid and have been leveraging the use of commodity server and disk hardware to build massive storage arrays (FC & iSCSI), one of the same things ZFS is working to do. I looked at some of the documentation for this topic of discussion and this is what I found: CRC/Checksum Error Detection In SANmelody and SANsymphony, enhanced error detection can be provided by enabling Cyclic Redundancy Check (CRC), a form of sophisticated redundancy check. When CRC/Checksum is enabled, the iSCSI driver adds a bit scheme to the iSCSI packet when it is transmitted. The iSCSI driver then verifies the bits in the packet when it is received to ensure data integrity. This error detection method provides a low probability of undetected errors compared to standard error checking performed by TCP/IP. The CRC bits may be added to either Data Digest, Header Digest, or both. DataCore has been really good at implementing all the features of the ''high end'' arrays for the ''low end'' price point. Dave Richard Elling wrote:> Orvar Korvar wrote: > >> Ive studied all links here. But I want information of the HW raid controllers. Not about ZFS, because I have plenty of ZFS information now. The closest thing I got was >> www.baarf.org >> >> > > [one of my favorite sites ;-)] > The problem is that there is no such thing as "hardware RAID" there is > only "software RAID." The "HW RAID" controllers are processors > running software and the features of the product are therefore limited by > the software developer and processor capabilities. I goes without saying > that the processors are very limited, compared to the main system CPU > found on modern machines. It also goes without saying that the software > (or firmware, if you prefer) is closed. Good luck cracking that nut. > > >> Where in one article he states that "raid5 never does parity check on reads". Ive wrote that to the Linux guys. And also "raid6 guesses when it tries to repair some errors with a chance of corrupting more". Thats hard facts. >> >> > > The high-end RAID arrays have better, more expensive processors and > a larger feature set. Some even add block-level checksumming, which has > led to some fascinating studies on field failures. But I think it is > safe to > assume that those features will not exist on the low-end systems for some > time. > -- richard > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > >
>>>>> "ca" == Carsten Aulbert <carsten.aulbert at aei.mpg.de> writes: >>>>> "ok" == Orvar Korvar <knatte_fnatte_tjatte at yahoo.com> writes:ca> (using hdparm''s utility makebadsector) I haven''t used that before, but it sounds like what you did may give the RAID layer some extra information. If one of the disks reports ``read error---I have no idea what''s stored in that sector,'''' then RAID5/6 knows which disk is wrong because the disk confessed. If all the disks successfully return data, but one returns the wrong data, RAID5/6 has to determine the wrong disk by math, not by device driver error returns. I don''t think RAID6 reads whole stripes, so even if the dual parity has some theoretical/implemented ability to heal single-disk silent corruption, it''d do this healing only during some scrub-like procedure, not during normal read. The benefit is better seek bandwidth than raidz. If the corruption is not silent (the disk returns an error) then it could use the hypothetical magical single-disk healing ability during normal read too. ca> powered it up and ran a volume check and the controller did ca> indeed find the corrupted sector sooo... (1) make your corrupt sector with dd rather than hdparm, like dd if=/dev/zero of=/dev/disk bs=512 count=1 seek=12345 conv=notrunc, and (2) check for the corrupt sector by reading the disk normally---either make sure the corrupt sector is inside a checksummed file like a tar or gz and use tar t or gzip -t, or use dd if=/dev/raidvol | md5sum before and after corrupting, something like that, NOT a ``volume check''''. Make both 1, 2 changes and I think the corruption will get through. Make only the first change but not the second, and you can look for this hypothetical math-based healing ability you''re saying RAID6 has from having more parity than it needs for the situation. ok> "upon writing data to disc, ZFS reads it back and compares to ok> the data in RAM and corrects it otherwise". I don''t think it does read-after-write. That''d be really slow. The thing I don''t like about the checksums is that they trigger for things other than bad disks, like if your machine loses power during a resilver, or other corner cases and bugs. I think the Netapp block-level RAID-layer checksums don''t trigger for as many other reasons as the ZFS filesystem-level checksums, so chasing problems is easier. The good thing is that they are probably helping survive the corner cases and bugs, too. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081231/6c8cb604/attachment.bin>
>>>>> "db" == Dave Brown <dbrown at csolutions.net> writes:db> CRC/Checksum Error Detection In SANmelody and SANsymphony, db> enhanced error detection can be provided by enabling Cyclic db> Redundancy Check (CRC) [...] The CRC bits may db> be added to either Data Digest, Header Digest, or both. Thanks for the plug, but that sounds like an iSCSI feature, between storage controller and client, not between storage controller and disk. It sounds suspiciously like they''re advertising something many vendors do without bragging, but I''m not sure. Anyway we''re talking about something different: writing to the disk in checksummed packets, so the storage controller can tell when the disk has silently returned bad data or another system has written to part of the disk, stuff like that---checksums to protect data as time passes, not as it travels through space. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081231/b8592433/attachment.bin>
On Wed, Dec 31, 2008 at 12:58 PM, Miles Nordin <carton at ivy.net> wrote:> >>>>> "db" == Dave Brown <dbrown at csolutions.net> writes: > > db> CRC/Checksum Error Detection In SANmelody and SANsymphony, > db> enhanced error detection can be provided by enabling Cyclic > db> Redundancy Check (CRC) [...] The CRC bits may > db> be added to either Data Digest, Header Digest, or both. > > Thanks for the plug, but that sounds like an iSCSI feature, between > storage controller and client, not between storage controller and > disk. It sounds suspiciously like they''re advertising something many > vendors do without bragging, but I''m not sure. Anyway we''re talking > about something different: writing to the disk in checksummed packets, > so the storage controller can tell when the disk has silently returned > bad data or another system has written to part of the disk, stuff like > that---checksums to protect data as time passes, not as it travels > through space. >The CRC checking is at least standard on QLogic hardware HBA''s. I would imagine most vendors have it in their software stacks as well since it''s part of the iSCSI standard. It was more of a corner case for iSCSI to try to say "look, I''m as good as Fibre Channel" than anything else (IMO). Although that opinion may very well be inaccurate :) --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081231/1f784a5f/attachment.html>
Happy new year! Snowing here and my new year party was cancelled. Ok, let me do more boring IT stuff then. Orvar, sorry I misunderstood you. Please feel free to explore the limitations of hardware RAID, and hopefully one day you will come to a conclusion that -- it was invented for saving CPU juice from disk management in order to better fulfill the application needs, and that fundamental driver is weakening day by day. NetApp argued that with today''s CPU power and server technologies, software RAID can be as efficient and even better if it is done right. And Datacore went beyond NetApp by enabling a software delivery to customers, instead of an integrated platform... Anyway, if you are still into checking out HW RAID capabilities, I would suggest to do that in a categorized fashion. As you can see, there are many many RAID cards at very very different price points. It is not fair to make a statement that covers all of them. (and I can go to china tomorrow and burn any firmware into a RAID ASIC and challenge that statement...) Hence your request was a bit too difficult -- if you tell the list which HW RAID adapter you are focusing on, I am sure the list will knock that one off in no time. ;-) http://www.ciao.com/Sun_StorageTek_SAS_RAID_Host_Bus_Adapter__15537063 Best, z, bored ----- Original Message ----- From: Tim To: Miles Nordin Cc: zfs-discuss at opensolaris.org Sent: Wednesday, December 31, 2008 3:20 PM Subject: Re: [zfs-discuss] ZFS vs HardWare raid - data integrity? On Wed, Dec 31, 2008 at 12:58 PM, Miles Nordin <carton at ivy.net> wrote: >>>>> "db" == Dave Brown <dbrown at csolutions.net> writes: db> CRC/Checksum Error Detection In SANmelody and SANsymphony, db> enhanced error detection can be provided by enabling Cyclic db> Redundancy Check (CRC) [...] The CRC bits may db> be added to either Data Digest, Header Digest, or both. Thanks for the plug, but that sounds like an iSCSI feature, between storage controller and client, not between storage controller and disk. It sounds suspiciously like they''re advertising something many vendors do without bragging, but I''m not sure. Anyway we''re talking about something different: writing to the disk in checksummed packets, so the storage controller can tell when the disk has silently returned bad data or another system has written to part of the disk, stuff like that---checksums to protect data as time passes, not as it travels through space. The CRC checking is at least standard on QLogic hardware HBA''s. I would imagine most vendors have it in their software stacks as well since it''s part of the iSCSI standard. It was more of a corner case for iSCSI to try to say "look, I''m as good as Fibre Channel" than anything else (IMO). Although that opinion may very well be inaccurate :) --Tim ------------------------------------------------------------------------------ _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081231/626fd2ea/attachment.html>
"The problem is that there is no such thing as "hardware RAID" there is only "software RAID." The "HW RAID" controllers are processors running software and the features of the product are therefore limited by the software developer and processor capabilities. I goes without saying that the processors are very limited, compared to the main system CPU found on modern machines. It also goes without saying that the software (or firmware, if you prefer) is closed. Good luck cracking that nut." -- Richard Yes, thx! And beyond that, there are HW RAID adapters and HW RAID chips embedded into disk enclosures, they are all HW RAID ASICs with closed software, not very Open Storage. ;-) best, z
Mattias Pantzare <pantzare <at> gmail.com> writes:> On Tue, Dec 30, 2008 at 11:30, Carsten Aulbert wrote: > > [...] > > where we wrote data to the RAID, powered the system down, pulled out one > > disk, inserted it into another computer and changed the sector checksum > > of a few sectors (using hdparm''s utility makebadsector). > > You are talking about diffrent types of errors. You tested errors that > the disk can detect. That is not a problem on any RAID, that is what > it is designed for.Mattias pointed out to me in a private email I missed Carsten''s mention of hdparm --make-bad-sector. Duh! So Carsten: Mattias is right, you did not simulate a silent data corruption error. hdparm --make-bad-sector just introduces a regular media error that *any* RAID level can detect and fix. -marc
Nice... More on sector checksum -- * anything prior to 2005 would be sort of out-of-date/fashion, because http://www.patentstorm.us/patents/6952797/description.html * the software RAID - NetApp view http://pages.cs.wisc.edu/~krioukov/ParityLostAndParityRegained-FAST08.ppt * the Linux view http://www.nber.org/sys-admin/linux-nas-raid.html best, z ----- Original Message ----- From: "Marc Bevand" <m.bevand at gmail.com> To: <zfs-discuss at opensolaris.org> Sent: Thursday, January 01, 2009 6:40 PM Subject: Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?> Mattias Pantzare <pantzare <at> gmail.com> writes: >> On Tue, Dec 30, 2008 at 11:30, Carsten Aulbert wrote: >> > [...] >> > where we wrote data to the RAID, powered the system down, pulled out >> > one >> > disk, inserted it into another computer and changed the sector checksum >> > of a few sectors (using hdparm''s utility makebadsector). >> >> You are talking about diffrent types of errors. You tested errors that >> the disk can detect. That is not a problem on any RAID, that is what >> it is designed for. > > Mattias pointed out to me in a private email I missed Carsten''s mention of > hdparm --make-bad-sector. Duh! > > So Carsten: Mattias is right, you did not simulate a silent data > corruption > error. hdparm --make-bad-sector just introduces a regular media error that > *any* RAID level can detect and fix. > > -marc > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Carsten Aulbert
2009-Jan-02 07:06 UTC
[zfs-discuss] ZFS vs HardWare raid - data integrity?
Hi Marc (and all the others), Marc Bevand wrote:> So Carsten: Mattias is right, you did not simulate a silent data corruption > error. hdparm --make-bad-sector just introduces a regular media error that > *any* RAID level can detect and fix.OK, I''ll need to go back to our tests performed months ago, but my feeling is now that we didn''t it right in the first place. Will take some time to retest that. Cheers Carsten
Hi Carsten, Carsten Aulbert wrote:> Hi Marc, > > Marc Bevand wrote: >> Carsten Aulbert <carsten.aulbert <at> aei.mpg.de> writes: >>> In RAID6 you have redundant parity, thus the controller can find out >>> if the parity was correct or not. At least I think that to be true >>> for Areca controllers :) >> Are you sure about that ? The latest research I know of [1] says that >> although an algorithm does exist to theoretically recover from >> single-disk corruption in the case of RAID-6, it is *not* possible to >> detect dual-disk corruption with 100% certainty. And blindly running >> the said algorithm in such a case would even introduce corruption on a >> third disk. >> > > Well, I probably need to wade through the paper (and recall Galois field > theory) before answering this. We did a few tests in a 16 disk RAID6 > where we wrote data to the RAID, powered the system down, pulled out one > disk, inserted it into another computer and changed the sector checksum > of a few sectors (using hdparm''s utility makebadsector).> ... You need not to wade through your paper... ECC theory tells, that you need a minimum distance of 3 to correct one error in a codeword, ergo neither RAID-5 or RAID-6 are enough: you need RAID-2 (which nobody uses today). Raid-Controllers today take advantage of the fact that they know, which disk is returning the bad block, because this disk returns a read error. ZFS is even able to correct, when an error in the data exist, but no disk is reporting a read error, because ZFS ensures the integrity from root-block to the data blocks with a long checksum accompanying the block pointers. A disk can deliver bad data without returning a read error by - misdirected read (bad positioning of disk head before reading) - previously misdirected write (on writing this sector) - unfourtunate sector error (data wrong, but checksum is ok) These events can happen and are documented on disk vendors web pages: a) A bad head positioning is estimated per 10^8 to 10^9 head moves. => this is more than once in 8 weeks on a fully loaded disk b) Unrecoverable data error (bad data on disk) is around one sector per 10^16 Bytes read. => one unrecoverable error per 177 TByte read. OK, these numbers seem pretty good, but when you have 1000 disks in your datacenter, you will have at least one of these errors each day... Therfore: Use ZFS in a redundant configuration! Regards, Ulrich -- | Ulrich Graef, Senior System Engineer, OS Ambassador \ | Operating Systems, Performance \ Platform Technology \ | Mail: Ulrich.Graef at Sun.COM \ Global Systems Enginering \ | Phone: +49 6103 752 359 \ Sun Microsystems Inc \ Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten Amtsgericht Muenchen: HRB 161028 Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer Vorsitzender des Aufsichtsrates: Martin Haering
Ulrich Graef wrote:> You need not to wade through your paper... > ECC theory tells, that you need a minimum distance of 3 > to correct one error in a codeword, ergo neither RAID-5 or RAID-6 > are enough: you need RAID-2 (which nobody uses today). > > Raid-Controllers today take advantage of the fact that they know, > which disk is returning the bad block, because this disk returns > a read error. > > ZFS is even able to correct, when an error in the data exist, > but no disk is reporting a read error, > because ZFS ensures the integrity from root-block to the data blocks > with a long checksum accompanying the block pointers. > >The Netapp paper mentioned by JZ (http://pages.cs.wisc.edu/~krioukov/ParityLostAndParityRegained-FAST08.ppt) talks about write verify. Would this feature make sense in a ZFS environment? I''m not sure if there is any advantage. It seems quite unlikely, when data is written in a redundant way to two different disks, that both disks lose or misdirect the same writes. Maybe ZFS could have an option to enable instant readback of written blocks, if one wants to be absolutely sure, data is written correctly to disk.
On Fri, Jan 2, 2009 at 10:47 AM, Mika Borner <opensolaris at bluewin.ch> wrote:> Ulrich Graef wrote: > > You need not to wade through your paper... > > ECC theory tells, that you need a minimum distance of 3 > > to correct one error in a codeword, ergo neither RAID-5 or RAID-6 > > are enough: you need RAID-2 (which nobody uses today). > > > > Raid-Controllers today take advantage of the fact that they know, > > which disk is returning the bad block, because this disk returns > > a read error. > > > > ZFS is even able to correct, when an error in the data exist, > > but no disk is reporting a read error, > > because ZFS ensures the integrity from root-block to the data blocks > > with a long checksum accompanying the block pointers. > > > > > > The Netapp paper mentioned by JZ > (http://pages.cs.wisc.edu/~krioukov/ParityLostAndParityRegained-FAST08.ppt<http://pages.cs.wisc.edu/%7Ekrioukov/ParityLostAndParityRegained-FAST08.ppt> > ) > talks about write verify. > > Would this feature make sense in a ZFS environment? I''m not sure if > there is any advantage. It seems quite unlikely, when data is written in > a redundant way to two different disks, that both disks lose or > misdirect the same writes. > > Maybe ZFS could have an option to enable instant readback of written > blocks, if one wants to be absolutely sure, data is written correctly to > disk. >Seems to me it would make a LOT of sense in a WORM type system. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090102/6844493f/attachment.html>
Tim wrote:> > > > The Netapp paper mentioned by JZ > (http://pages.cs.wisc.edu/~krioukov/ParityLostAndParityRegained-FAST08.ppt > <http://pages.cs.wisc.edu/%7Ekrioukov/ParityLostAndParityRegained-FAST08.ppt>) > talks about write verify. > > Would this feature make sense in a ZFS environment? I''m not sure if > there is any advantage. It seems quite unlikely, when data is > written in > a redundant way to two different disks, that both disks lose or > misdirect the same writes. > > Maybe ZFS could have an option to enable instant readback of written > blocks, if one wants to be absolutely sure, data is written > correctly to > disk. > > > Seems to me it would make a LOT of sense in a WORM type system.Since ZFS only deals with block devices, how would we guarantee that the subsequent read was satisfied from the media rather than a cache? If the answer is that we just wait long enough for the caches to be emptied, then the existing scrub should work, no? -- richard
Folks feedback on my spam communications was that -- I jump from point to point too fast and am lazy to explain and often somewhat misleading. ;-) On the NetApp thing, please note they had their time talking about SW RAID can be as good as/better than HW RAID. However, from a customer point of view, the math is done in a reversed fashion. Roughly, for a 3-9 (99.9%) availability, customer has 8 hours of annual downtime, and RAID could help; for a 4-9 (99.99%) availability, customer has 45 minutes of annual downtime, and RAID alone won''t do, H/A clustering may be needed (without clustering, a big iron box, such as ES70000, can do 99.98%, but hard to reach 99.99%, in our past field studies). for a 5-9 (99.999%) availability, customer has 5 minutes of annual downtime, and H/A clustering with automated stateful failover is a must. So, for every additional 9, the customer needs to learn additional pages in the NetApp price book, which I think that''s the real issue with NetApp (enterprise customers with the checkbooks may have absolutely no idea about how RAID checksum would impact their SLO/SLA costs.) I have not done a cost study on ZFS towards the 9999999s, but I guess we can do better with more system and I/O based assurance over just RAID checksum, so customers can get to more 99998888s with less redundant hardware and software feature enablement fees. Also note that the upcoming NetApp ONTAP/GX converged release would hopefully improve the NetApp solution cost structure at some level, but I cannot discuss that until it''s officially released [beyond keep screaming "6920+ZFS"]. ;-) best, z ----- Original Message ----- From: "Richard Elling" <Richard.Elling at Sun.COM> To: "Tim" <tim at tcsac.net> Cc: <zfs-discuss at opensolaris.org>; "Ulrich Graef" <Ulrich.Graef at Sun.COM> Sent: Friday, January 02, 2009 2:35 PM Subject: Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?> Tim wrote: >> >> >> >> The Netapp paper mentioned by JZ >> >> (http://pages.cs.wisc.edu/~krioukov/ParityLostAndParityRegained-FAST08.ppt >> >> <http://pages.cs.wisc.edu/%7Ekrioukov/ParityLostAndParityRegained-FAST08.ppt>) >> talks about write verify. >> >> Would this feature make sense in a ZFS environment? I''m not sure if >> there is any advantage. It seems quite unlikely, when data is >> written in >> a redundant way to two different disks, that both disks lose or >> misdirect the same writes. >> >> Maybe ZFS could have an option to enable instant readback of written >> blocks, if one wants to be absolutely sure, data is written >> correctly to >> disk. >> >> >> Seems to me it would make a LOT of sense in a WORM type system. > > Since ZFS only deals with block devices, how would we guarantee > that the subsequent read was satisfied from the media rather than a > cache? If the answer is that we just wait long enough for the caches > to be emptied, then the existing scrub should work, no? > -- richard > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Bob Friesenhahn
2009-Jan-03 01:21 UTC
[zfs-discuss] ZFS vs HardWare raid - data integrity?
On Fri, 2 Jan 2009, JZ wrote:> > I have not done a cost study on ZFS towards the 9999999s, but I guess we can > do better with more system and I/O based assurance over just RAID checksum, > so customers can get to more 99998888s with less redundant hardware and > software feature enablement fees.Even with a fairly trival ZFS setup using hot-swap drive bays, the primary factor impacting "availability" are non-disk related factors such as motherboard, interface cards, and operating system bugs. Unless you step up to an exotic fault-tolerant system ($$$), an entry-level server will offer as much availability as a mid-range server, and many "enterprise" servers. In fact, the simple entry-level server may offer more availability due to being simpler. The charts on Richard Elling''s blog make that pretty obvious. Is is best not to confuse "data integrity" with "availability". Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Yes, agreed. However, for enterprises with risk management as a key factor building into their decision making processes -- what if the integrity risk is reflected on Joe Tucci''s personal network data? OMG, big impact to the SLA when the SLA is critical... [right, Tim?] ;-) -z ----- Original Message ----- From: "Bob Friesenhahn" <bfriesen at simple.dallas.tx.us> To: "JZ" <jz at excelsioritsolutions.com> Cc: <zfs-discuss at opensolaris.org> Sent: Friday, January 02, 2009 8:21 PM Subject: Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?> On Fri, 2 Jan 2009, JZ wrote: >> >> I have not done a cost study on ZFS towards the 9999999s, but I guess we >> can >> do better with more system and I/O based assurance over just RAID >> checksum, >> so customers can get to more 99998888s with less redundant hardware and >> software feature enablement fees. > > Even with a fairly trival ZFS setup using hot-swap drive bays, the primary > factor impacting "availability" are non-disk related factors such as > motherboard, interface cards, and operating system bugs. Unless you step > up to an exotic fault-tolerant system ($$$), an entry-level server will > offer as much availability as a mid-range server, and many "enterprise" > servers. In fact, the simple entry-level server may offer more > availability due to being simpler. The charts on Richard Elling''s blog > make that pretty obvious. > > Is is best not to confuse "data integrity" with "availability". > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >
On second thought, let me further explain why I had the Linux link in the same post. That was written a while ago, but I think the situation for the cheap RAID cards has not changed much, though the RAID ASICs in RAID enclosures are getting more and more robust, just not "open". If you take risk management into consideration, that range of chance is just too much to take, when the data demand is not only for data accessing, but also for accessing the correct data. We are talking about 0.001% of defined downtime headroom for a 4-9 SLA (that may be defined as "accessing the correct data"). You and me can wait half a day for network failures and the world can turn just as fine, but not for Joe Tucci. Not to mention the additional solution ($$$) that must be implemented to handle possible user operational errors for highly risky users. [still not for you and me for the business case of this solution, not so sure about Mr. Tucci though. ;-) ] best, z http://www.nber.org:80/sys-admin/linux-nas-raid.html Let''s repeat the reliability calculation with our new knowledge of the situation. In our experience perhaps half of drives have at least one unreadable sector in the first year. Again assume a 6 percent chance of a single failure. The chance of at least one of the remaining two drives having a bad sector is 75% (1-(1-.5)^2). So the RAID 5 failure rate is about 4.5%/year, which is .5% MORE than the 4% failure rate one would expect from a two drive RAID 0 with the same capacity. Alternatively, if you just had two drives with a partition on each and no RAID of any kind, the chance of a failure would still be 4%/year but only half the data loss per incident, which is considerably better than the RAID 5 can even hope for under the current reconstruction policy even with the most expensive hardware. We don''t know what the reconstruction policy is for other raid controllers, drivers or NAS devices. None of the boxes we bought acknowledged this "gotcha" but none promised to avoid it either. We assume Netapp and ECCS have this under control, since we have had several single drive failures on those devices with no difficulty resyncing. We have not had a single drive failure yet in the MVD based boxes, so we really don''t know what they will do. [Since that was written we have had such failures, and they were able to reconstruct the failed drive, but we don''t know if they could always do so]. Some mitigation of the danger is possible. You could read and write the entire drive surface periodically, and replace any drives with even a single uncorrectable block visible. A daemon Smartd is available for Linux that will scan the disk in background for errors and report them. We had been running that, but ignored errors on unwritten sectors, because we were used to such errors disappearing when the sector was written (and the bad sector remapped). Our current inclination is to shift to a recent 3ware controller, which we understand has a "continue on error" rebuild policy available as an option in the array setup. But we would really like to know more about just what that means. What do the apparently similar RAID controllers from Mylex, LSI Logic and Adaptec do about this? A look at their web sites reveals no information.
Bob Friesenhahn
2009-Jan-03 03:31 UTC
[zfs-discuss] ZFS vs HardWare raid - data integrity?
On Fri, 2 Jan 2009, JZ wrote:> We are talking about 0.001% of defined downtime headroom for a 4-9 SLA (that > may be defined as "accessing the correct data").It seems that some people spend a lot of time analyzing their own hairy navel and think that it must be the surely be center of the universe due to its resemblance to a black hole. If you just turn the darn computer off, then you can claim that the reaons for the loss of availability is the computer and not the fault of data storage. A MTTDL of 10^14 or 10^16 is much larger than any number of trailing 9s that people like to dream about. Even the pyramids are crumbling now.> You and me can wait half a day for network failures and the world can turn > just as fine, but not for Joe Tucci.This guy only knows how to count beans. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
haha, this makes a cheerful new year start -- this kind of humor is only available at open storage. BTW, I did not know the pyramids are crumbling now, since it was built with love. but the GreatWall was crumbling, since it was built with hate (until we fixed part of that for tourist $$$). (if this is a text-only email, see attached pic, hopefully the Solaris mail server does that conversion automatically like Lotus.) -z ----- Original Message ----- From: "Bob Friesenhahn" <bfriesen at simple.dallas.tx.us> To: "JZ" <jz at excelsioritsolutions.com> Cc: <zfs-discuss at opensolaris.org> Sent: Friday, January 02, 2009 10:31 PM Subject: Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?> On Fri, 2 Jan 2009, JZ wrote: >> We are talking about 0.001% of defined downtime headroom for a 4-9 SLA >> (that may be defined as "accessing the correct data"). > > It seems that some people spend a lot of time analyzing their own hairy > navel and think that it must be the surely be center of the universe due > to its resemblance to a black hole. > > If you just turn the darn computer off, then you can claim that the reaons > for the loss of availability is the computer and not the fault of data > storage. A MTTDL of 10^14 or 10^16 is much larger than any number of > trailing 9s that people like to dream about. Even the pyramids are > crumbling now. > >> You and me can wait half a day for network failures and the world can >> turn just as fine, but not for Joe Tucci. > > This guy only knows how to count beans. > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 31262 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090102/d3c11354/attachment.jpe>
A Darren Dunham
2009-Jan-04 02:33 UTC
[zfs-discuss] ZFS vs HardWare raid - data integrity?
On Wed, Dec 31, 2008 at 01:53:03PM -0500, Miles Nordin wrote:> The thing I don''t like about the checksums is that they trigger for > things other than bad disks, like if your machine loses power during a > resilver, or other corner cases and bugs. I think the Netapp > block-level RAID-layer checksums don''t trigger for as many other > reasons as the ZFS filesystem-level checksums, so chasing problems is > easier.Why does losing power during a resilver cause any issues for the checksums in ZFS? Admittedly, bugs can always cause problems, but that''s true for any software. I''m not sure that I see a reason that the integrated checksums and the separate checksums are more or less prone to bugs. Under what situations would you expect any differences between the ZFS checksums and the Netapp checksums to appear? I have no evidence, but I suspect the only difference (modulo any bugs) is how the software handles checksum failures. -- Darren
"ECC theory tells, that you need a minimum distance of 3 to correct one error in a codeword, ergo neither RAID-5 or RAID-6 are enough: you need RAID-2 (which nobody uses today)." What is "RAID-2"? Is it raidz2? -- This message posted from opensolaris.org
On Sun, Jan 4, 2009 at 5:47 PM, Orvar Korvar <knatte_fnatte_tjatte at yahoo.com> wrote:> "ECC theory tells, that you need a minimum distance of 3 > to correct one error in a codeword, ergo neither RAID-5 or RAID-6 > are enough: you need RAID-2 (which nobody uses today)." > > What is "RAID-2"? Is it raidz2? > -- >Google is your friend ;) http://www.pcguide.com/ref/hdd/perf/raid/levels/singleLevel2-c.html --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090104/37150f2d/attachment.html>
A Darren Dunham
2009-Jan-05 07:42 UTC
[zfs-discuss] ZFS vs HardWare raid - data integrity?
On Sat, Jan 03, 2009 at 09:58:37PM -0500, JZ wrote:>> Under what situations would you expect any differences between the ZFS >> checksums and the Netapp checksums to appear? >> >> I have no evidence, but I suspect the only difference (modulo any bugs) >> is how the software handles checksum failures.> As I said, some NetApp folks I won''t attack. > http://andyleonard.com/2008/03/05/on-parity-lost/ > > And some I don''t really care. RAID-DP was cool (but CPQ ProLiant RAID ADG > was there for a long time too...). > http://partners.netapp.com/go/techontap/matl/sample/0206tot_resiliency.html > > And then some I have no idea who said what... > http://www.feedage.com/feeds/1625300/comments-on-unanswered-questions-about-netapp >Those documents discuss the Netapp checksum methods, but I don''t get from them under what situations you would expect the ZFS and Netapp techniques would provide different levels of validation (or what conditions would cause one to fail but not the other). -- Darren
Darren, we have spent much time on this topic. I have provided enough NetApp docs to you and you seem studied. So please study the ZFS docs available at Sun. Any thoughts you need folks to validate, please post. But the list does not do the thinking for you. The ways of implementing technologies are almost infinite and things can go wrong in different ways for different causes. And addtional elements in the solution will impact the overall chance of things going wrong. That''s as much as I can say, based on your question. Goodnight! z ----- Original Message ----- From: "A Darren Dunham" <ddunham at taos.com> To: <zfs-discuss at opensolaris.org> Sent: Monday, January 05, 2009 2:42 AM Subject: Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?> On Sat, Jan 03, 2009 at 09:58:37PM -0500, JZ wrote: >>> Under what situations would you expect any differences between the ZFS >>> checksums and the Netapp checksums to appear? >>> >>> I have no evidence, but I suspect the only difference (modulo any bugs) >>> is how the software handles checksum failures. > >> As I said, some NetApp folks I won''t attack. >> http://andyleonard.com/2008/03/05/on-parity-lost/ >> >> And some I don''t really care. RAID-DP was cool (but CPQ ProLiant RAID ADG >> was there for a long time too...). >> http://partners.netapp.com/go/techontap/matl/sample/0206tot_resiliency.html >> >> And then some I have no idea who said what... >> http://www.feedage.com/feeds/1625300/comments-on-unanswered-questions-about-netapp >> > > Those documents discuss the Netapp checksum methods, but I don''t get > from them under what situations you would expect the ZFS and Netapp > techniques would provide different levels of validation (or what > conditions would cause one to fail but not the other). > > -- > Darren > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
For SCSI disks (including FC), you would use the FUA bit on the read command. For SATA disks ... does anyone care? ;-) -- This message posted from opensolaris.org
[ok, no one replying, my spam then...] Open folks just care about SMART so far. http://www.mail-archive.com/linux-scsi at vger.kernel.org/msg07346.html Enterprise folks care more about spin-down. (not an open thing yet, unless new practical industry standard is here that I don''t know. yeah right.) best, z ----- Original Message ----- From: "Anton B. Rang" <rang at acm.org> To: <zfs-discuss at opensolaris.org> Sent: Tuesday, January 06, 2009 9:07 AM Subject: Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?> For SCSI disks (including FC), you would use the FUA bit on the read > command. > > For SATA disks ... does anyone care? ;-) > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Ok, folks, new news - [feel free to comment in any fashion, since I don''t know how yet.] EMC ACQUIRES OPEN-SOURCE ASSETS FROM SOURCELABS http://go.techtarget.com/r/5490612/6109175 -------------- next part -------------- A non-text attachment was scrubbed... Name: joetucci.jpg Type: image/jpeg Size: 31262 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090106/42abc9e2/attachment.jpg>
Folks, I have had much fun and caused much trouble. I hope we now have learned the "open" spirit of storage. I will be less involved with the list discussion going forward, since me too have much work to do in my super domain. [but I still have lunch hours, so be good!] As I always say, thank you very much for the love and the tolerance! Take good care of you baby data in 2009, open folks! Best, zStorageAnalyst
Thank you. How does raidz2 compare to raid-2? Safer? Less safe? -- This message posted from opensolaris.org
RAID 2 is something weird that no one uses, and really only exists on paper as part of Berkeley''s original RAID paper, IIRC. raidz2 is more or less RAID 6, just like raidz is more or less RAID 5. With raidz2, you have to lose 3 drives per vdev before data loss occurs. Scott On Thu, Jan 8, 2009 at 7:01 AM, Orvar Korvar <knatte_fnatte_tjatte at yahoo.com> wrote:> Thank you. How does raidz2 compare to raid-2? Safer? Less safe? > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On Thu, Jan 8, 2009 at 10:01, Orvar Korvar <knatte_fnatte_tjatte at yahoo.com> wrote:> Thank you. How does raidz2 compare to raid-2? Safer? Less safe?Raid-2 is much less used, for one, uses many more disks for parity, for two, and is much slower in any application I can think of. Suppose you have 11 100G disks. Raid-2 would use 7 for data and 4 for parity, total capacity 700G, and would be able to recover from any single bit flips per data row (e.g., if any disk were lost or corrupted (!), it could recover its contents). This is not done using checksums, but rather ECC. One could implement checksums on top of this, I suppose. A major downside of raid-2 is that "efficient" use of space only happens when the raid groups are of size 2**k-1 for some integer k; this is because the Hamming code includes parity bits at certain intervals (see [1]). Raidz2, on the other hand, would take your 11 100G disks and use 9 for data and 2 for parity, and put checksums on blocks. This means that recovering any two corrupt or missing disks (as opposed to one with raid-2) is possible; with any two pieces of a block potentially damaged, one can calculate all the possibilities for what the block could have been before damage and accept the one whose calculated checksum matches the stored one. Thus, raidz2 is safer and more storage-efficient than raid-2. This is all mostly academic, as nobody uses raid-2. It''s only as safe as raidz (can repair one error, or detect two) and space efficiency for normal-sized arrays is fairly atrocious. Use raidz{,2} and forget about it. Will [1]: http://en.wikipedia.org/wiki/Hamming_code#General_algorithm
[just for the beloved Orvar] Ok, rule of thumb to save you some open time -- anything with "z", or "j", would probably be safe enough for your baby data. And yeah, I manage my own lunch hours BTW. :-) best, z ----- Original Message ----- From: "Orvar Korvar" <knatte_fnatte_tjatte at yahoo.com> To: <zfs-discuss at opensolaris.org> Sent: Thursday, January 08, 2009 10:01 AM Subject: Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?> Thank you. How does raidz2 compare to raid-2? Safer? Less safe? > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: joetucci.jpg Type: image/jpeg Size: 31262 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090108/3792b320/attachment.jpg>
Got some more information about HW raid vs ZFS: http://www.opensolaris.org/jive/thread.jspa?messageID=326654#326654 -- This message posted from opensolaris.org
Ok, Orvar, that''s why I always liked you. You really want to get to the point, otherwise you won''t give up. So this is all about the Windows thing huh? Yes, I love MS Storage because we shared/sharing a common dream. Yes, I love King and High because they are not arrogant if you respect them. And Yes, I would love to see OpenSolaris and Windows become the best alternatives, not Linux and Windows. So now we can be all happy and do IT? Fulfilling? best, z ----- Original Message ----- From: "Orvar Korvar" <knatte_fnatte_tjatte at yahoo.com> To: <zfs-discuss at opensolaris.org> Sent: Tuesday, January 13, 2009 3:46 PM Subject: Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?> Got some more information about HW raid vs ZFS: > http://www.opensolaris.org/jive/thread.jspa?messageID=326654#326654 > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
???????? no one is working tonight? where is the discussions? ok, I will not be picking on Orvar all the time, if that''s why... the windows statements was heavy, but hey, I am at home, not at work, it was just because Orvar was suffering. folks, are we not going to do IT just because I played with Orvar? Ok, this is it for me tonight, no more spam. Happy? Goodnight z
Still not happy? I guess I will have do more spam myself -- So, I have to explain why I didn''t like Linux but I like MS and OpenSolaris? I don''t have any religious love for MS or Sun. Just that I believe, talents are best utilized in an organized and systematic fashion, to benefit the whole. Leadership and talent management is important in any organization. Linux has a community. A community is not an organization, but can be guided by an organization. OpenSolaris is guided by Sun, by folks I trust that will lead in a constructive fashion. And MS Storage, they had the courage years ago, to say, we can do storage, datacenter storage, just as well as we do desktop, because we can learn, and we can dream! And they strived, in an organized and systematic fashion, under some constructive leadership. Folks, please, I don''t know why there has been no post to the list. I would be very arrogant if I think I can cause the silence in technology discussions. Please. Best, z
Ok, so someone is doing IT and has questions. Thank you! [I did not post this using another name, because I am too honorable to do that.] This is a list discussion, should not be paused for one voice. best, z [If Orvar has other questions that I have not addressed, please ask me off-list. It''s ok.]
Folks, what can I post to the list to make the discussion go on? Is this what you folks want to see? which I shared with King and High but not you folks? http://www.excelsioritsolutions.com/jz/jzbrush/jzbrush.htm This is not even IT stuff so that I never thought I should post this to the list... This is getting too strange for an open discussion. Please, folks. Best, z