Orvar''s post over in opensol-discuss has me thinking: After reading the paper and looking at design docs, I''m wondering if there is some facility to allow for comparing data in the ARC to it''s corresponding checksum. That is, if I''ve got the data I want in the ARC, how can I be sure it''s correct (and free of hardware memory errors)? I''d assume the way is to also store absolutely all the checksums for all blocks/metadatas being read/written in the ARC (which, of course, means that only so much RAM corruption can be compensated for), and do a validation when that every time that block is used/written from the ARC. You''d likely have to do constant metadata consistency checking, and likely have to hold multiple copies of metadata in-ARC to compensate for possible corruption. I''m assuming that this has at least been explored, right? (the researchers used non-ECC RAM, so honestly, I think it''s a bit unrealistic to expect that your car will win the Indy 500 if you put a Yugo engine in it) - normally, this problem is exactly what you have hardware ECC and memory scrubbing for at the hardware level. I''m not saying that ZFS should consider doing this - doing a validation for in-memory data is non-trivially expensive in performance terms, and there''s only so much you can do and still expect your machine to survive. I mean, I''ve used the old NonStop stuff, and yes, you can shoot them with a .45 and it likely will still run, but wacking them with a bazooka still is guarantied to make them, well, Non-NonStop. -Erik -------- Original Message -------- Subject: Re: [osol-discuss] Any news about 2010.3? Date: Wed, 31 Mar 2010 01:06:45 PDT From: Orvar Korvar <knatte_fnatte_tjatte at yahoo.com> To: opensolaris-discuss at opensolaris.org If you value your data, you should reconsider. But if your data is not important, then skip ZFS. File system data corruption test by researcher: http://blogs.zdnet.com/storage/?p=169 ZFS data corruption test by researchers: http://www.cs.wisc.edu/wind/Publications/zfs-corruption-fast10.pdf -- This message posted from opensolaris.org _______________________________________________ opensolaris-discuss mailing list opensolaris-discuss at opensolaris.org -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
>I''m not saying that ZFS should consider doing this - doing a validation >for in-memory data is non-trivially expensive in performance terms, and >there''s only so much you can do and still expect your machine to >survive. I mean, I''ve used the old NonStop stuff, and yes, you can >shoot them with a .45 and it likely will still run, but wacking them >with a bazooka still is guarantied to make them, well, Non-NonStop.If we scrub the memory anyway, why not include the check of the ZFS checksums which are already in memory? OTOH, zfs gets a lot of mileage out of cheap hardware and we know what the limitations are when you don''t use ECC; the industry must start to require that all chipsets support ECC. Casper
Casper.Dik at Sun.COM wrote:> >> I''m not saying that ZFS should consider doing this - doing a validation >> for in-memory data is non-trivially expensive in performance terms, and >> there''s only so much you can do and still expect your machine to >> survive. I mean, I''ve used the old NonStop stuff, and yes, you can >> shoot them with a .45 and it likely will still run, but wacking them >> with a bazooka still is guarantied to make them, well, Non-NonStop. >> > > If we scrub the memory anyway, why not include the check of the ZFS > checksums which are already in memory? > > OTOH, zfs gets a lot of mileage out of cheap hardware and we know what the > limitations are when you don''t use ECC; the industry must start to require > that all chipsets support ECC. > > CaspeReading the paper was interesting, as it highlighted all the places where ZFS "skips" validation. There''s a lot of places. In many ways, fixing this would likely make ZFS similar to AppleTalk whose notorious performance (relative to Ethernet) was caused by what many called the "Are You Sure?" design. Double and Triple checking absolutely everything has it''s costs. And, yes, we really should just force computer manufacturers to use ECC in more places (not just RAM) - as densities and data volumes increase, we are more likely to see errors, and without proper hardware checking, we''re really going out on a limb here to be able to trust what the hardware says. And, let''s face it - hardware error correction is /so/ much faster than doing it in software. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
The ECC enabled RAM should be very cheap quickly if the industry embraces it in every computer. :-) best regards, hanzhu On Wed, Mar 31, 2010 at 5:46 PM, Erik Trimble <erik.trimble at oracle.com>wrote:> Casper.Dik at Sun.COM wrote: > >> >> >>> I''m not saying that ZFS should consider doing this - doing a validation >>> for in-memory data is non-trivially expensive in performance terms, and >>> there''s only so much you can do and still expect your machine to survive. I >>> mean, I''ve used the old NonStop stuff, and yes, you can shoot them with a >>> .45 and it likely will still run, but wacking them with a bazooka still is >>> guarantied to make them, well, Non-NonStop. >>> >>> >> >> If we scrub the memory anyway, why not include the check of the ZFS >> checksums which are already in memory? >> >> OTOH, zfs gets a lot of mileage out of cheap hardware and we know what the >> limitations are when you don''t use ECC; the industry must start to require >> that all chipsets support ECC. >> >> Caspe >> > Reading the paper was interesting, as it highlighted all the places where > ZFS "skips" validation. There''s a lot of places. In many ways, fixing this > would likely make ZFS similar to AppleTalk whose notorious performance > (relative to Ethernet) was caused by what many called the "Are You Sure?" > design. Double and Triple checking absolutely everything has it''s costs. > > And, yes, we really should just force computer manufacturers to use ECC in > more places (not just RAM) - as densities and data volumes increase, we are > more likely to see errors, and without proper hardware checking, we''re > really going out on a limb here to be able to trust what the hardware says. > And, let''s face it - hardware error correction is /so/ much faster than > doing it in software. > > > > > > > -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > Timezone: US/Pacific (GMT-0800) > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100331/09597a6c/attachment.html>
On 31/03/2010 10:27, Erik Trimble wrote:> Orvar''s post over in opensol-discuss has me thinking: > > After reading the paper and looking at design docs, I''m wondering if > there is some facility to allow for comparing data in the ARC to it''s > corresponding checksum. That is, if I''ve got the data I want in the ARC, > how can I be sure it''s correct (and free of hardware memory errors)? I''d > assume the way is to also store absolutely all the checksums for all > blocks/metadatas being read/written in the ARC (which, of course, means > that only so much RAM corruption can be compensated for), and do a > validation when that every time that block is used/written from the ARC. > You''d likely have to do constant metadata consistency checking, and > likely have to hold multiple copies of metadata in-ARC to compensate for > possible corruption. I''m assuming that this has at least been explored, > right?A subset of this is already done. The ARC keeps its own in memory checksum (because some buffers in the ARC are not yet on stable storage so don''t have a block pointer checksum yet). http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c arc_buf_freeze() arc_buf_thaw() arc_cksum_verify() arc_cksum_compute() It isn''t done on every access but it can detect in memory corruption - I''ve seen it happen on several occasions but all due to errors in my code not bad physical memory. Doing in more frequently could cause a significant performance problem. -- Darren J Moffat
> > > On 31/03/2010 10:27, Erik Trimble wrote: >> Orvar''s post over in opensol-discuss has me thinking: >> >> After reading the paper and looking at design docs, I''m wondering if >> there is some facility to allow for comparing data in the ARC to it''s >> corresponding checksum. That is, if I''ve got the data I want in the ARC, >> how can I be sure it''s correct (and free of hardware memory errors)? I''d >> assume the way is to also store absolutely all the checksums for all >> blocks/metadatas being read/written in the ARC (which, of course, means >> that only so much RAM corruption can be compensated for), and do a >> validation when that every time that block is used/written from the ARC. >> You''d likely have to do constant metadata consistency checking, and >> likely have to hold multiple copies of metadata in-ARC to compensate for >> possible corruption. I''m assuming that this has at least been explored, >> right? > > A subset of this is already done. The ARC keeps its own in memory > checksum (because some buffers in the ARC are not yet on stable > storage so don''t have a block pointer checksum yet). > > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c > > > arc_buf_freeze() > arc_buf_thaw() > arc_cksum_verify() > arc_cksum_compute() > > It isn''t done on every access but it can detect in memory corruption - > I''ve seen it happen on several occasions but all due to errors in my > code not bad physical memory. > > Doing in more frequently could cause a significant performance problem. >or there might be an extra zpool level (or system wide) property to enable checking checksums onevery access from ARC - there will be a siginificatn performance impact but then it might be acceptable for really paranoid folks especially with modern hardware. -- Robert Milkowski http://milek.blogspot.com
On Wed, 31 Mar 2010, Robert Milkowski wrote:> > or there might be an extra zpool level (or system wide) property to enable > checking checksums onevery access from ARC - there will be a siginificatn > performance impact but then it might be acceptable for really paranoid folks > especially with modern hardware.How would this checking take place for memory mapped files? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 31/03/2010 16:44, Bob Friesenhahn wrote:> On Wed, 31 Mar 2010, Robert Milkowski wrote: >> >> or there might be an extra zpool level (or system wide) property to >> enable checking checksums onevery access from ARC - there will be a >> siginificatn performance impact but then it might be acceptable for >> really paranoid folks especially with modern hardware. > > How would this checking take place for memory mapped files? >Well, and it wouldn''t help if data were corrupted in an application internal buffer after read() succeeded, or just before an application does a write(). So I wasn''t saying that it can work or that it can work in all circumstances but rather I was trying to say that it probably shouldn''t be dismissed on a performance argument alone as for some use cases with modern HW it might well be that the performance will still be acceptable while providing still better protection and data correctness guarantee. But even then while mmap() issue is probably solvable the read() and write() cases are probably not. -- Robert Milkowski http://milek.blogspot.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2010/03/31 05:13, Darren J Moffat wrote:> On 31/03/2010 10:27, Erik Trimble wrote: >> Orvar''s post over in opensol-discuss has me thinking: >> >> After reading the paper and looking at design docs, I''m wondering if >> there is some facility to allow for comparing data in the ARC to it''s >> corresponding checksum. That is, if I''ve got the data I want in the ARC, >> how can I be sure it''s correct (and free of hardware memory errors)? I''d >> assume the way is to also store absolutely all the checksums for all >> blocks/metadatas being read/written in the ARC (which, of course, means >> that only so much RAM corruption can be compensated for), and do a >> validation when that every time that block is used/written from the ARC. >> You''d likely have to do constant metadata consistency checking, and >> likely have to hold multiple copies of metadata in-ARC to compensate for >> possible corruption. I''m assuming that this has at least been explored, >> right? > > A subset of this is already done. The ARC keeps its own in memory > checksum (because some buffers in the ARC are not yet on stable storage > so don''t have a block pointer checksum yet). > > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c > > > arc_buf_freeze() > arc_buf_thaw() > arc_cksum_verify() > arc_cksum_compute() > > It isn''t done on every access but it can detect in memory corruption - > I''ve seen it happen on several occasions but all due to errors in my > code not bad physical memory. > > Doing in more frequently could cause a significant performance problem.Agreed. I think it''s probably not a very good idea to check it everywhere. It would be great if we can do some checks occasionally especially for critical data structures, but, if it''s the memory we can not trust, how can we trust that the checksum checker to behave correctly? I had some questions about the FAST paper mentioned by Erik, which was not answered during the conference which makes me feel that the paper, while pointed out some interesting issues, but failed to prove it being a real world problem: - How much probability a bit flipping can happen on a non-ECC system? say, how much bits would be flipped per terabytes processed, or transactions or something? - Among these flipped bits, how much would happen on a file system buffer? What happens when, say, the application''s memory hit a flipped bit, and when the file system itself have no problem with its buffer? - How much performance penalty would be if we check the checksums every time the data is being accessed? How good will the check be compared to an ECC in terms of correctness? Cheers, - -- Xin LI <delphij at delphij.net> http://www.delphij.net/ FreeBSD - The Power to Serve! Live free or die -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) iQEcBAEBAgAGBQJLs+UZAAoJEATO+BI/yjfBfE0H/0+iG/pgrs/JNId814g5JMki eZ2tJx2Lf7+DIlrHczvcwyWAtAke7ojUMeNEw6HIqMfTQHVcgMk2XNdxWZn0sJsy PUPj9Qcg+nkHcewAoWvG0VUZN0fSBX1OtJcVG78Kt5drWmT+g5jiMH+BFCEAiISJ Kcfswp9r0JbYmI010fwqugc74bAZnMhUXMCvvplJZUE3iaDCq499TanKIVmKu4vq JsDNYXZT9Nqbb20DB4TKluauP1QVUJnBAeqfQCYZ/+CqK5+phnUgzyaBTiMKBHd0 Q0l1bvGEvjLRarlGk7/702Udu7HC4UKs09pKtBIb+cw8CmyYaZ8Vuth0Ri0drzM=S5WS -----END PGP SIGNATURE-----
On Thu, Apr 01, 2010 at 12:38:29AM +0100, Robert Milkowski wrote:> So I wasn''t saying that it can work or that it can work in all > circumstances but rather I was trying to say that it probably shouldn''t > be dismissed on a performance argument alone as for some use casesIt would be of great utility even if considered only as a diagnostic measure - ie, for qualifying tests or when something else raises suspicion and you want to eliminate/confirm sources of problems. With a suitable pointer in a FAQ/troubleshooting guide, it could reduce the number / improve the quality of problem reports related to bad h/w. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100401/d3016c58/attachment.bin>