Experimenting with OpenSolaris on an elderly PC with equally elderly drives, zpool status shows errors after a pkg image-update followed by a scrub. It is entirely possible that one of these drives is flaky, but surely the whole point of a zfs mirror is to avoid this? It seems unlikely that both drives failed at the same time. Could someone explain how this can happen? Another question (perhaps for the indiana folks) is how to restore these files? # zpool status -v pool: rpool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub completed after 0h24m with 2 errors on Wed Apr 15 09:15:40 2009 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 69 mirror ONLINE 0 0 144 c3d0s0 ONLINE 0 0 145 128K repaired c3d1s0 ONLINE 0 0 151 168K repaired errors: Permanent errors have been detected in the following files: //lib/amd64/libsec.so.1 //lib/libdlpi.so.1
On Wed, 15 Apr 2009, Frank Middleton wrote:> Experimenting with OpenSolaris on an elderly PC with equally > elderly drives, zpool status shows errors after a pkg image-update > followed by a scrub. It is entirely possible that one of these > drives is flaky, but surely the whole point of a zfs mirror is > to avoid this? It seems unlikely that both drives failed at the > same time. Could someone explain how this can happen? Another > question (perhaps for the indiana folks) is how to restore these > files?If a corruption occured in the main memory, the backplane, or the disk controller during the writes to these files, then the original data written could be corrupted, even though you are using mirrors. If the system experienced a physical shock, or power supply glitch, while the data was written, then it could impact both drives. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 04/15/09 14:30, Bob Friesenhahn wrote:> On Wed, 15 Apr 2009, Frank Middleton wrote: >> zpool status shows errors after a pkg image-update >> followed by a scrub.> If a corruption occured in the main memory, the backplane, or the disk > controller during the writes to these files, then the original data > written could be corrupted, even though you are using mirrors. If the > system experienced a physical shock, or power supply glitch, while the > data was written, then it could impact both drives.Quite. Sounds like an architectural problem. This old machine probably doesn''t have ecc memory (AFAIK still rare on most PCs), but it is on a serial UPS and isolated from shocks, and this has happened more than once. These drives on this machine recently passed both the purge and verify cycles (format/analyze) several times. Unless the data is written to both drives from the same buffer and checksum (surely not!), it is still unclear how it could get written to *both* drives with a bad checksum. It looks like the files really are bad - neither of them can be read - unless ZFS sensibly refuses to allow possibly good files with bad checksums to be read (cannot read: I/O error). BTW fmdump -ev doesn''t seem to report any disk errors at all. So my question remains - even with the grottiest hardware, how can several files get written with bad checksums to mirrored drives? ZFS has so many cool features this would be easy to live with if there was a reasonably simple way to get copies of these files to restore them, short of getting the source and recompiling, or pkg uninstall followed by install (if you can figure out which pkg(s) the bad files are in), but it seems to defeat the purpose of softwaremirroring...
On 15-Apr-09, at 8:31 PM, Frank Middleton wrote:> On 04/15/09 14:30, Bob Friesenhahn wrote: >> On Wed, 15 Apr 2009, Frank Middleton wrote: >>> zpool status shows errors after a pkg image-update >>> followed by a scrub. > >> If a corruption occured in the main memory, the backplane, or the >> disk >> controller during the writes to these files, then the original data >> written could be corrupted, even though you are using mirrors. If the >> system experienced a physical shock, or power supply glitch, while >> the >> data was written, then it could impact both drives. > > Quite. Sounds like an architectural problem. This old machine probably > doesn''t have ecc memory (AFAIK still rare on most PCs), but it is on > a serial UPS and isolated from shocks, and this has happened more > than once. These drives on this machine recently passed both the purge > and verify cycles (format/analyze) several times. Unless the data is > written to both drives from the same buffer and checksum (surely > not!),Doesn''t seem that far-fetched...> it is still unclear how it could get written to *both* drives with a > bad checksum. It looks like the files really are bad - neither of > them can be read - unless ZFS sensibly refuses to allow possibly good > files with bad checksums to be read (cannot read: I/O error). > > BTW fmdump -ev doesn''t seem to report any disk errors at all. > > So my question remains - even with the grottiest hardware, how can > several files get written with bad checksums to mirrored drives?Bad RAM would seem a possible cause, wouldn''t it? --Toby> ZFS > has so many cool features this would be easy to live with if there > was a reasonably simple way to get copies of these files to restore > them, short of getting the source and recompiling, or pkg uninstall > followed by install (if you can figure out which pkg(s) the bad files > are in), but it seems to defeat the purpose of softwaremirroring... > > > > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>Quite. Sounds like an architectural problem. This old machine probably >doesn''t have ecc memory (AFAIK still rare on most PCs), but it is on >a serial UPS and isolated from shocks, and this has happened more >than once. These drives on this machine recently passed both the purge >and verify cycles (format/analyze) several times. Unless the data is >written to both drives from the same buffer and checksum (surely not!),You really believe that the copy was copied and checksummed twice before writing to the disk? Of course not. Copying the data doesn''t help; both pieces of memory need to be good. It''s checksummed once. The checksum fails to verify so: - the memory was corrupted after the checksum was computed - the data was damaged en route to the disk - the data was damaged on disk. - the data was damaged back from the disk - the data was damaged in memory>it is still unclear how it could get written to *both* drives with a >bad checksum. It looks like the files really are bad - neither of >them can be read - unless ZFS sensibly refuses to allow possibly good >files with bad checksums to be read (cannot read: I/O error).That can happen when the memory is corrupted before the first write to disk.>So my question remains - even with the grottiest hardware, how can >several files get written with bad checksums to mirrored drives? ZFS >has so many cool features this would be easy to live with if there >was a reasonably simple way to get copies of these files to restore >them, short of getting the source and recompiling, or pkg uninstall >followed by install (if you can figure out which pkg(s) the bad files >are in), but it seems to defeat the purpose of softwaremirroring...Bad memory. Casper
Frank Middleton wrote:> Experimenting with OpenSolaris on an elderly PC with equally > elderly drives, zpool status shows errors after a pkg image-update > followed by a scrub. It is entirely possible that one of these > drives is flaky, but surely the whole point of a zfs mirror is > to avoid this? It seems unlikely that both drives failed at the > same time.Possible causes: + bad CPU + bad memory, or memory which does not self-correct transient errors + faulty cabling + electrically noisy environment + multiply all of the above by 3 or more (main CPU, HBA, disk logic) Alas, today we don''t know if bad data was written to both sides of the mirror. Rather, ZFS does not report if both sides of the mirror agree, but the checksum fails. I filed an RFE on this, because it would help diagnose where the corruption occurred: memory/data path vs medium.> Could someone explain how this can happen? Another > question (perhaps for the indiana folks) is how to restore these > files? >These files look like they would have been delivered in an OS. If so, just copy a good version over the bad. -- richard> # zpool status -v > pool: rpool > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://www.sun.com/msg/ZFS-8000-8A > scrub: scrub completed after 0h24m with 2 errors on Wed Apr 15 > 09:15:40 2009 > config: > > NAME STATE READ WRITE CKSUM > rpool ONLINE 0 0 69 > mirror ONLINE 0 0 144 > c3d0s0 ONLINE 0 0 145 128K repaired > c3d1s0 ONLINE 0 0 151 168K repaired > > errors: Permanent errors have been detected in the following files: > > //lib/amd64/libsec.so.1 > //lib/libdlpi.so.1 > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 04/16/09 04:39, Casper.Dik at Sun.COM wrote:> > You really believe that the copy was copied and checksummed twice before > writing to the disk? Of course not. Copying the data doesn''t help; > both pieces of memory need to be good. It''s checksummed once.If OpenSolaris succeeds in being significantly adopted as a desktop o/s, it is going to be running on some pretty grotty hardware. No ecc memory, cheap PCI controllers, etc. Clearly this computer has hardware problems, my guess is the PCI itself, although it seems to run Linux and OpenSolaris quite happily. If the memory is so bad that two separately computed check- sums fail, then I doubt it would run anything reliably. FWIW it passes every diagnostic I''ve run, but that doesn''t prove anything... ZFS can''t catch the case where the data is bad before it is checksummed, so we can ignore that one for this discussion. This scenario seems to have bad checksums or bad data (or both) being written to both disks. So why not copy and store the data+checksum twice? In the grand scheme of things, it is hard to believe that this would add significant overhead, (maybe even speed things up if both disks can be written in parallel?) and it would help diagnosing what is a novel problem. Let CSA and CSB = stored checksums, and CRA and CRB be the recomputed checksums after the data is read from each mirror. Presumably a scrub always reads both sides of a mirror, so all permutations are possible. One interesting case is where CSA == CSB and CRA == CRB but CSA != CRA vs. the case where all 4 checksums are different. It seems improbable that two disks would fail in the same way at the same moment, so this scenario would point at some other source of error. It would be helpful to know which scenario is happening. Good old reliable Sun products with ecc bus and memory simply don''t have this kind of problem. The hardware detects it long before it becomes a software issue. Not so with el-cheapo PCs whose owners will likely be frustrated (see the "[zfs-discuss] How recoverable is an ''unrecoverable error''?" thread) when their previously seemingly reliable disks start to apparently fail in mysterious ways. I''d like to submit an RFE suggesting that data + checksum be copied for mirrored writes, but I won''t waste anyone''s time doing so unless you think there is a point. One might argue that a machine this flaky should be retired, but it is actually working quite well, and perhaps represents not even the extreme of bad hardware that ZFS might encounter. Cheers -- Frank
>I''d like to submit an RFE suggesting that data + checksum be copied for >mirrored writes, but I won''t waste anyone''s time doing so unless you >think there is a point. One might argue that a machine this flaky should >be retired, but it is actually working quite well, and perhaps represents >not even the extreme of bad hardware that ZFS might encounter.I think it''s a stupid idea. If you get two checksums, what can you do? The second copy is most likely suspect and you double your chance that you use bad memory. Casper
On 17-Apr-09, at 11:49 AM, Frank Middleton wrote:> ... One might argue that a machine this flaky should > be retired, but it is actually working quite well,If it has bad memory, you won''t get much useful work done on it until the memory is replaced - unless you want to risk your data with random failues, and potentially waste large amounts of time. You should do a comprehensive memory test ASAP and replace what''s not working. ZFS'' job isn''t to test your memory, so I think the proposed patch is pointless. It also doesn''t address the case where the application buffer is corrupt. --T> and perhaps represents > not even the extreme of bad hardware that ZFS might encounter. > > Cheers -- Frank > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 04/17/09 12:37, Casper.Dik at Sun.COM wrote:>> I''d like to submit an RFE suggesting that data + checksum be copied for >> mirrored writes, but I won''t waste anyone''s time doing so unless you >> think there is a point. One might argue that a machine this flaky should >> be retired, but it is actually working quite well, and perhaps represents >> not even the extreme of bad hardware that ZFS might encounter. > > I think it''s a stupid idea. If you get two checksums, what can you do? > The second copy is most likely suspect and you double your chance that you > use bad memory. > > CasperIf there were permanently bad memory locations, surely the diagnostics would reveal them. Here''s an interesting paper on memory errors: http://www.ece.rochester.edu/~mihuang/PAPERS/hotdep07.pdf Given the inevitability of relatively frequent transient memory errors, I would think it behooves the file system to minimize the effects of such errors. But I won''t belabor the point except to suggest that the cost of adding the suggested step would not be very expensive (either to implement or run). Memory diagnostics ran for a full 12 hours with no errors. Same goes for both disks, using Solaris format/ana/verify. So far, after creating 400,000 files, two files had permanent, apparently truly unrecoverable errors and could not be read by anything. Now it gets really funky. I detached one of the disks, and then found it couldn''t be reattached. Turns out there is a rounding problem with Solaris fdisk (run from format) that can cause identical partitions on identical disks to have different sizes. I used the Linux sfdisk utility to repair the MBR and fix the Solaris2 partition sizes. Then it was possible to reattach the disk. Unfortunately it wasn''t possible to boot from the result, but a reinstall went perfectly with no ZFS errors being reported at all. So it appears that the problem may be with the OpenSolaris fdisk. Is this worth reporting as a bug? It is likely to be quite hard to reproduce...
>If there were permanently bad memory locations, surely the diagnostics >would reveal them. Here''s an interesting paper on memory errors: >http://www.ece.rochester.edu/~mihuang/PAPERS/hotdep07.pdf >Given the inevitability of relatively frequent transient memory >errors, I would think it behooves the file system to minimize the >effects of such errors. But I won''t belabor the point except to >suggest that the cost of adding the suggested step would not be >very expensive (either to implement or run).I''m still not clear what you win. You copy the data (which isn''t actually that cheap, especially when running a load which uses a lot of memory bandwidth). And no what? You can''t write two different checksums; I mean, we''re mirroring the data so it MUST BE THE SAME. (A different checksum would be wrong: I don''t think ZFS will allow different checksums for different sides of a mirror) You are assuming that the error is the memory being modified after computing the checksums; I would say that that is unlikely; I think it''s a bit more likely that the data gets corrupted when it''s handled by the disk controller or the disk itself. (The data is continuously re-written by the DRAM controller)>Memory diagnostics ran for a full 12 hours with no errors. Same goes >for both disks, using Solaris format/ana/verify. So far, after >creating 400,000 files, two files had permanent, apparently truly >unrecoverable errors and could not be read by anything.It would have been nice if we were able to recover the contents of the file; if you also know what was supposed to be there, you can diff and then we can find out what was wrong.>Now it gets really funky. I detached one of the disks, and then found >it couldn''t be reattached. Turns out there is a rounding problem with >Solaris fdisk (run from format) that can cause identical partitions on >identical disks to have different sizes. I used the Linux sfdisk >utility to repair the MBR and fix the Solaris2 partition sizes. Then >it was possible to reattach the disk. Unfortunately it wasn''t possible >to boot from the result, but a reinstall went perfectly with no ZFS >errors being reported at all. So it appears that the problem may be >with the OpenSolaris fdisk. Is this worth reporting as a bug? It is >likely to be quite hard to reproduce...There might be some skeletons buried in the ide device drivers; I once had a disk which broke (well, one sector was broken or more); so I added "bad sectors" in format. But the disk seemed to be bad, even after I "check disk" tool from Western Digital. The disk would hang when I read certain bits. Then I copied the disk to an identical disk; it hang in the same way. Then I "zapped" the copy, relabeled it, copied the data per slice (not the while disk, but a per slice) and then the new disk worked. So while the first disk was broken (it Western Digital tool moved some sectors somewhere else), adding "bad sectors" in Solaris broke "something else". Casper
There have been a number of threads here on the reliability of ZFS in the face of flaky hardware. ZFS certainly runs well on decent (e.g., SPARC) hardware, but isn''t it reasonable to expect it to run well on something less well engineered? I am a real ZFS fan, and I''d hate to see folks trash it because it appears to be unreliable. In an attempt to bolster the proposition that there should at least be an option to buffer the data before checksumming and writing, we''ve been doing a lot of testing on presumed flaky (cheap) hardware, with a peculiar result - see below. On 04/21/09 12:16, Casper.Dik at Sun.COM wrote:> And so what? You can''t write two different checksums; I mean, we''re > mirroring the data so it MUST BE THE SAME. (A different checksum would be > wrong: I don''t think ZFS will allow different checksums for different > sides of a mirror)Unless it does a read after write on each disk, how would it know that the checksums are the same? If the data is damaged before the checksum is calculated then it is no worse than the ufs/ext3 case. If data + checksum is damaged whilst the (single) checksum is being calculated, or after, then the file is already lost before it is even written! There is a significant probability that this could occur on a machine with no ecc. Evidently memory concerns /are/ an issue - this thread http://opensolaris.org/jive/thread.jspa?messageID=338148 even suggests including a memory diagnostic with the distribution CD (Fedora already does so). Memory diagnostics just test memory. Disk diagnostics just test disks. ZFS keeps disks pretty busy, so perhaps it loads the power supply to the point where it heats up and memory glitches are more likely. It might also explain why errors don''t really begin until ~15 minutes after the busy time starts. You might argue that this problem could only affect systems doing a lot of disk i/o and such systems probably have ecc memory. But doing an o/s install is the one time where a consumer grade computer does a *lot* of disk i/o for quite a long time and is hence vulnerable. Ironically, the Open Solaris installer does not allow for ZFS mirroring at install time, one time where it might be really important! Now that sounds like a more useful RFE, especially since it would be relatively easy to implement. Anaconda does it... A Solaris install writes almost 4*10^10 bits. Quoting Wikipedia, look at Cypress on ECC, see http://www.edn.com/article/CA454636.html. Possibly, statistically likely random memory glitches could actually explain the error rate that is occurring.> You are assuming that the error is the memory being modified after > computing the checksums; I would say that that is unlikely; I think it''s a > bit more likely that the data gets corrupted when it''s handled by the disk > controller or the disk itself. (The data is continuously re-written by > the DRAM controller)See below for an example where a checksum error occurs without the disk subsystem being involved. There seems to be no other plausible explanation other than an improbable bug in X86 ZFS itself.> It would have been nice if we were able to recover the contents of the > file; if you also know what was supposed to be there, you can diff and > then we can find out what was wrong."file" on those files resulted in "bus error". Is there a way to actually read a file reported by ZFS as unrecoverable to do just that (and to separately retrieve the copy from each half of the mirror)? Maybe this should be a new thread, but I suspect the following proves that the problem must be memory, and that begs the question as to how memory glitches can cause fatal ZFS checksum errors. Here is the peculiar result (same machine) After several attempts, succeeded in doing a zfs send to a file on a NFS mounted ZFS file system on another machine (SPARC) followed by a ZFS recv of that file there. But every attempt to do a ZFS recv of the same snapshot (i.e., from NFS) on the local machine (X86) has failed with a checksum mismatch. Obviously, the file is good, since it was possible to do a zfs recv from it. You can''t blame the IDE drivers (or the bus, or the disks) for this. Similarly, piping the snapshot though SSH fails, so you can''t blame NFS either. Something is happening to cause checksum failures between after when the data is received by the PC and when ZFS computes its checksums. Surely this is either a highly repeatable memory glitch, or (most unlikely) a bug in X86 ZFS. ZFS recv to another SPARC over SSH to the same physical disk (accessed via a sata/pata adapter) was also successful. Does this prove that the data+checksum is being corrupted by memory glitches? Both NFS and SSH over TCP/IP provide reliable transport (via checksums), so the data is presumably received correctly. ZFS then calculates its own checksum and it fails. Oddly, it /always/ fails, but not at the same point, and far into the stream when both disks have been very busy for a while. It would be interesting to see if the checksumming still fails if the writes were somehow skipped or sent to /dev/null. If it still fails. it should be possible to pinpoint the failure. If not, then it would seem the the only recourse is to replace the machine or not use ZFS even though it is otherwise quite reliable (it has been running an XDMCP session for 2 weeks now with no apparent glitches; even zpool status shows no errors at all after a couple of scrubs). It would be even more interesting to hear speculation as to why another machine can recv the datastream but not the one that originated it. If a memory that can pass diagnostics for 24 hours at a stretch can cause glitches in huge datastreams, then IMO it behooves ZFS to defend itself against them. Buffering disk i/o on machines with no ECC seems like reasonably cheap insurance against a whole class of errors, and could make ZFS usable on PCs that, although they work fine with ext3, fail annoyingly with ZFS. Ironically this wouldn''t fix the peculiar recv problem, which none-the-less seems to point to memory glitches as a source of errors. -- Frank
On 22-May-09, at 5:24 PM, Frank Middleton wrote:> There have been a number of threads here on the reliability of ZFS > in the > face of flaky hardware. ZFS certainly runs well on decent (e.g., > SPARC) > hardware, but isn''t it reasonable to expect it to run well on > something > less well engineered? I am a real ZFS fan, and I''d hate to see folks > trash it because it appears to be unreliable. > > In an attempt to bolster the proposition that there should at least be > an option to buffer the data before checksumming and writing, we''ve > been doing a lot of testing on presumed flaky (cheap) hardware, with a > peculiar result - see below. > > On 04/21/09 12:16, Casper.Dik at Sun.COM wrote: > >> And so what? You can''t write two different checksums; I mean, we''re >> mirroring the data so it MUST BE THE SAME. (A different checksum >> would be >> wrong: I don''t think ZFS will allow different checksums for different >> sides of a mirror) > > Unless it does a read after write on each disk, how would it know that > the checksums are the same? If the data is damaged before the checksum > is calculated then it is no worse than the ufs/ext3 case. If data + > checksum is damaged whilst the (single) checksum is being calculated, > or after, then the file is already lost before it is even written! > There is a significant probability that this could occur on a machine > with no ecc. Evidently memory concerns /are/ an issueYes, the important thing is to *detect* them, no system can run reliably with bad memory, and that includes any system with ZFS. Doing nutty things like calculating the checksum twice does not buy anything of value here. If the memory is this bad then applications will be dying all over the place, compilers will be segfaulting, and databases will be writing bad data even before it reaches ZFS.> - this thread > http://opensolaris.org/jive/thread.jspa?messageID=338148 even suggests > including a memory diagnostic with the distribution CD (Fedora already > does so).Absolutely, memory diags are essential. And you certainly run them if you see unexpected behaviour that has no other obvious cause.> > Memory diagnostics just test memory. Disk diagnostics just test disks. > ZFS keeps disks pretty busy, so perhaps it loads the power supply > to the point where it heats up and memory glitches are more likely.Your logic is rather tortuous. If the hardware is that crappy then there''s not much ZFS can do about it.> It might also explain why errors don''t really begin until ~15 minutes > after the busy time starts. > > You might argue that this problem could only affect systems doing a > lot of disk i/o and such systems probably have ecc memory. But doing > an o/s install is the one time where a consumer grade computer does > a *lot* of disk i/o for quite a long time and is hence vulnerable. > Ironically, the Open Solaris installer does not allow for ZFS > mirroring at install time, one time where it might be really > important! > Now that sounds like a more useful RFE, especially since it would be > relatively easy to implement. Anaconda does it... > > A Solaris install writes almost 4*10^10 bits. Quoting Wikipedia, look > at Cypress on ECC, see http://www.edn.com/article/CA454636.html. > Possibly, statistically likely random memory glitches could actually > explain the error rate that is occurring. > >> You are assuming that the error is the memory being modified after >> computing the checksums; I would say that that is unlikely; I >> think it''s a >> bit more likely that the data gets corrupted when it''s handled by >> the disk >> controller or the disk itself. (The data is continuously re- >> written by >> the DRAM controller) > > See below for an example where a checksum error occurs without the > disk subsystem being involved. There seems to be no other plausible > explanation other than an improbable bug in X86 ZFS itself. > >> It would have been nice if we were able to recover the contents of >> the >> file; if you also know what was supposed to be there, you can diff >> and >> then we can find out what was wrong. > > "file" on those files resulted in "bus error". Is there a way to > actually > read a file reported by ZFS as unrecoverable to do just that (and to > separately retrieve the copy from each half of the mirror)? > > Maybe this should be a new thread, but I suspect the following > proves that the problem must be memory, and that begs the question > as to how memory glitches can cause fatal ZFS checksum errors.Of course they can; but they will also break anything else on the machine. ...> If a memory that can pass diagnostics for 24 hours at a > stretch can cause glitches in huge datastreams, then IMO it > behooves ZFS to defend itself against them. Buffering disk > i/o on machines with no ECC seems like reasonably cheap > insurance against a whole class of errors, and could make > ZFS usable on PCs that, although they work fine with ext3,How can a machine with bad memory "work fine with ext3"? --Toby> fail annoyingly with ZFS. Ironically this wouldn''t fix the > peculiar recv problem, which none-the-less seems to point > to memory glitches as a source of errors. > > -- Frank > > > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>> If a memory that can pass diagnostics for 24 hours at a >> stretch can cause glitches in huge datastreams, then IMO it >> behooves ZFS to defend itself against them. Buffering disk >> i/o on machines with no ECC seems like reasonably cheap >> insurance against a whole class of errors, and could make >> ZFS usable on PCs that, although they work fine with ext3, > >How can a machine with bad memory "work fine with ext3"?"It appears to work". A long time ago I bought a new PC; it run Windows, it installed Solaris (pre-ZFS) but when I tried to build on-net, something would die because of a SIGBUS or a SIGSEGV. When I finally run memtest86 (did require a BIOS which supported a USB keyboard properly), I had one broken 512MB dimm and I replaced it. Similarly, when someone upgraded and started to use ZFS he continuously got bad checksums; and in the end it turned out the powersupply was broken (not a "bad brand" but one which was broken, delivering out-spec voltages) Casper
Casper.Dik at Sun.COM wrote:> > > >> If a memory that can pass diagnostics for 24 hours at a > >> stretch can cause glitches in huge datastreams, then IMO it > >> behooves ZFS to defend itself against them. Buffering disk > >> i/o on machines with no ECC seems like reasonably cheap > >> insurance against a whole class of errors, and could make > >> ZFS usable on PCs that, although they work fine with ext3, > > > >How can a machine with bad memory "work fine with ext3"? > > "It appears to work"....> When I finally run memtest86 (did require a BIOS which supported a USB > keyboard properly), I had one broken 512MB dimm and I replaced it.Another important fact is that Linux starts using physical memory from low addresses while Solaris takes care about the CPU cache and in addition allocates memory for DMA from the top physical pages. A PC with bad memory in the top parts may appear OK with other OS but this is a fallacious impression. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
<preface> This forum is littered with claims of "zfs checksums are broken" where the root cause turned out to be faulty hardware or firmware in the data path. </preface> I think that before you should speculate on a redesign, we should get to the root cause. Frank Middleton wrote:> There have been a number of threads here on the reliability of ZFS in the > face of flaky hardware. ZFS certainly runs well on decent (e.g., SPARC) > hardware, but isn''t it reasonable to expect it to run well on something > less well engineered? I am a real ZFS fan, and I''d hate to see folks > trash it because it appears to be unreliable.It depends on what you consider to be flaky. If a CPU has a stuck bit in the carry lookahead (can''t add properly for some pattern of operands), then it is flaky and will probably create bogus checksums, no?> > In an attempt to bolster the proposition that there should at least be > an option to buffer the data before checksumming and writing, we''ve > been doing a lot of testing on presumed flaky (cheap) hardware, with a > peculiar result - see below. > > On 04/21/09 12:16, Casper.Dik at Sun.COM wrote: > >> And so what? You can''t write two different checksums; I mean, we''re >> mirroring the data so it MUST BE THE SAME. (A different checksum >> would be >> wrong: I don''t think ZFS will allow different checksums for different >> sides of a mirror) > > Unless it does a read after write on each disk, how would it know that > the checksums are the same? If the data is damaged before the checksum > is calculated then it is no worse than the ufs/ext3 case.Even if you do a read after write, there is no guarantee that you will read from the medium instead of a cache. There is some concern here, in general, because some mobo RAID controllers and (I believe) some disk drives have caches which are not protected. These are generally not too much of a problem because the data is not resident for a significant period of time and the probability of a bit flip caused by radiation, for instance, is a function of time.> If data + > checksum is damaged whilst the (single) checksum is being calculated, > or after, then the file is already lost before it is even written!The checksum occurs in the pipeline prior to write to disk. So if the data is damaged prior to checksum, then ZFS will never know. Nor will UFS. Neither will be able to detect this. In Solaris, if the damage is greater than the ability of the memory system and CPU to detect or correct, then even Solaris won''t know. If the memory system or CPU detects a problem, then Solaris fault management will kick in and do something, preempting ZFS.> There is a significant probability that this could occur on a machine > with no ecc. Evidently memory concerns /are/ an issue - this thread > http://opensolaris.org/jive/thread.jspa?messageID=338148 even suggests > including a memory diagnostic with the distribution CD (Fedora already > does so).SunVTS ships with SCXE and Solaris 2.2-10. SunVTS replaced SunDiag which, IIRC, started shipping in SunOS 3. I believe SunVTS is available via OpenSolaris repository for those with support contracts. VTS is an acronym for Verification Test Suite and includes many tests, including memory tests. VTS is used to verify systems in the factory prior to shipping to customers. Look for /usr/sunvts on your system or search for the SUNWvts* packages and checkout the docs online.> > Memory diagnostics just test memory. Disk diagnostics just test disks.This is not completely accurate. Disk diagnostics also test the data path. Memory tests also test the CPU. The difference is the amount of test coverage for the subsystem.> ZFS keeps disks pretty busy, so perhaps it loads the power supply > to the point where it heats up and memory glitches are more likely.In general, for like configurations, ZFS won''t keep a disk any more busy than other file systems. In fact, because ZFS groups transactions, it may create less activity than other file systems, such as UFS.> > It might also explain why errors don''t really begin until ~15 minutes > after the busy time starts. > > You might argue that this problem could only affect systems doing a > lot of disk i/o and such systems probably have ecc memory. But doing > an o/s install is the one time where a consumer grade computer does > a *lot* of disk i/o for quite a long time and is hence vulnerable. > Ironically, the Open Solaris installer does not allow for ZFS > mirroring at install time, one time where it might be really important! > Now that sounds like a more useful RFE, especially since it would be > relatively easy to implement. Anaconda does it...This is not an accurate statement. The OpenSolaris installer does support mirrored boot disks via the Automated Installer method. http://dlc.sun.com/osol/docs/content/2008.11/AIinstall/index.html You can also install Solaris 10 to mirrored root pools via JumpStart.> > A Solaris install writes almost 4*10^10 bits. Quoting Wikipedia, look > at Cypress on ECC, see http://www.edn.com/article/CA454636.html. > Possibly, statistically likely random memory glitches could actually > explain the error rate that is occurring. > >> You are assuming that the error is the memory being modified after >> computing the checksums; I would say that that is unlikely; I think >> it''s a >> bit more likely that the data gets corrupted when it''s handled by the >> disk >> controller or the disk itself. (The data is continuously re-written by >> the DRAM controller) > > See below for an example where a checksum error occurs without the > disk subsystem being involved. There seems to be no other plausible > explanation other than an improbable bug in X86 ZFS itself.I think a better test would be to md5 the file from all systems and see if the md5 hashes are the same. If they are, then yes, the finger would point more in the direction of ZFS. The send/recv protocol hasn''t changed in quite some time, but it is arguably not as robust as it could be. ZFS send/recv use fletcher4 for the checksums. ZFS uses fletcher2 for data (by default) and fletcher4 for metadata. The same fletcher code is used. So if you believe fletcher4 is broken for send/recv, how do you explain that it works for the metadata? Or does it? There may be another failure mode at work here... (see comment on scrubs at the end of this extended post)>> It would have been nice if we were able to recover the contents of the >> file; if you also know what was supposed to be there, you can diff and >> then we can find out what was wrong. > > "file" on those files resulted in "bus error". Is there a way to actually > read a file reported by ZFS as unrecoverable to do just that (and to > separately retrieve the copy from each half of the mirror)?ZFS corrects automatically, when it can. But if the source data is bad, then ZFS couldn''t possibly detect it. For files that ZFS can detect are corrupted and cannot automatically correct, you can get the list from "zpool status -xv" The behaviour as seen by applications is determined by the zpool failmode property. In any event, if file core dumps consistently in the same part of the code, then please log a bug against file -- it should not core dump, no matter what input it receives.> > Maybe this should be a new thread, but I suspect the following > proves that the problem must be memory, and that begs the question > as to how memory glitches can cause fatal ZFS checksum errors. > > Here is the peculiar result (same machine) > > After several attempts, succeeded in doing a zfs send to a file > on a NFS mounted ZFS file system on another machine (SPARC) > followed by a ZFS recv of that file there. But every attempt to > do a ZFS recv of the same snapshot (i.e., from NFS) on the local > machine (X86) has failed with a checksum mismatch. Obviously, > the file is good, since it was possible to do a zfs recv from it. > You can''t blame the IDE drivers (or the bus, or the disks) for > this. Similarly, piping the snapshot though SSH fails, so you > can''t blame NFS either. Something is happening to cause checksum > failures between after when the data is received by the PC and > when ZFS computes its checksums. Surely this is either a highly > repeatable memory glitch, or (most unlikely) a bug in X86 ZFS. > ZFS recv to another SPARC over SSH to the same physical disk > (accessed via a sata/pata adapter) was also successful. > > Does this prove that the data+checksum is being corrupted by > memory glitches? Both NFS and SSH over TCP/IP provide reliable > transport (via checksums), so the data is presumably received > correctly. ZFS then calculates its own checksum and it fails. > Oddly, it /always/ fails, but not at the same point, and far > into the stream when both disks have been very busy for a while.Uhmm, if it were a software bug, one would expect it to fail at exactly the same place, no?> > It would be interesting to see if the checksumming still fails > if the writes were somehow skipped or sent to /dev/null. If it > still fails. it should be possible to pinpoint the failure. If > not, then it would seem the the only recourse is to replace > the machine or not use ZFS even though it is otherwise quite > reliable (it has been running an XDMCP session for 2 weeks > now with no apparent glitches; even zpool status shows no > errors at all after a couple of scrubs). It would be even > more interesting to hear speculation as to why another machine > can recv the datastream but not the one that originated it.Yep, interesting question. But since you say "even zpool status shows no error at all after a couple of scrubs" makes me think that you''ve had errors in the past?> > If a memory that can pass diagnostics for 24 hours at a > stretch can cause glitches in huge datastreams, then IMO it > behooves ZFS to defend itself against them. Buffering disk > i/o on machines with no ECC seems like reasonably cheap > insurance against a whole class of errors, and could make > ZFS usable on PCs that, although they work fine with ext3, > fail annoyingly with ZFS. Ironically this wouldn''t fix the > peculiar recv problem, which none-the-less seems to point > to memory glitches as a source of errors.I''m still a little confused. If ext3 can''t detect data errors, what verification have you used to back your claim that it is unaffected? Please check the image views with md5 digests and get back to us. If you get a chance, run SunVTS to verify the memory and CPU, too. If the CPU is b0rken, the fletcher4 checksum for the recv may be tickling it. <sidebar> Microsoft got so tired of defending its software against memory errors, that it requires Windows Server platforms to use ECC. But even Microsoft doesn''t have the power to force the vendors to use ECC for all PCs. </sidebar> -- richard
On 05/22/09 21:08, Toby Thain wrote:> Yes, the important thing is to *detect* them, no system can run reliably > with bad memory, and that includes any system with ZFS. Doing nutty > things like calculating the checksum twice does not buy anything of > value here.All memory is "bad" if it doesn''t have ECC. There are only varying degrees of badness. Calculating the checksum twice on its own would be nutty, as you say, but doing so on a separate copy of the data might prevent unrecoverable errors after writes to mirrored drives. You can''t detect memory errors if you don''t have ECC. But you can try to mitigate them. Without doing so makes ZFS less reliable than the memory it is running on. The problem is that ZFS makes any file with a bad checksum inaccessible, even if one really doesn''t care if the data has been corrupted. A workaround might be a way to allow such files to be readable despite the bad checksum... In hindsight I probably should have merely reported the problem and left those with more knowledge to propose a solution. Oh well.> If the memory is this bad then applications will be dying all over the > place, compilers will be segfaulting, and databases will be writing bad > data even before it reaches ZFS.But it isn''t. Applications aren''t dying, compilers are not segfaulting (it was even possible to compile GCC 4.3.2 with the supplied gcc); gdm is staying up for weeks at a time... And I wouldn''t consider running a non-trivial database application on a machine without ECC.> Absolutely, memory diags are essential. And you certainly run them if > you see unexpected behaviour that has no other obvious cause.Runs for days, as noted.> Your logic is rather tortuous. If the hardware is that crappy then > there''s not much ZFS can do about it.Well, it could. For example, it could make copies of the data before checksumming so that one memory hit doesn''t result in an unrecoverable file on a mirrored drive. Either that or there''s a bug in ZFS. I am more inclined to blame the memory, especially since the failure rate isn''t much higher than the expected rate as reported elsewhere.>> Maybe this should be a new thread, but I suspect the following >> proves that the problem must be memory, and that begs the question >> as to how memory glitches can cause fatal ZFS checksum errors. > > Of course they can; but they will also break anything else on the machine.But they don''t. Checksum errors are reasonable, but not unrecoverable ones on mirrors.> How can a machine with bad memory "work fine with ext3"?It does. It works fine with ZFS too. Just really annoying unrecoverable files every now and then on mirrored drives. This shouldn''t happen even with lousy memory and wouldn''t (doesn''t) with ECC. If there was a way to examine the files and their checksums, I would be surprised if they were different (If they were, it would almost certainly be the controller or the PCI bus itself causing the problem). But I speculate that it is predictable memory hits. -- Frank
>On 05/22/09 21:08, Toby Thain wrote: >> Yes, the important thing is to *detect* them, no system can run reliably >> with bad memory, and that includes any system with ZFS. Doing nutty >> things like calculating the checksum twice does not buy anything of >> value here. > >All memory is "bad" if it doesn''t have ECC. There are only varying >degrees of badness. Calculating the checksum twice on its own would >be nutty, as you say, but doing so on a separate copy of the data >might prevent unrecoverable errors after writes to mirrored drives. >You can''t detect memory errors if you don''t have ECC.And where exactly do you get the second good copy of the data? If you copy the code you''ve just doubled your chance of using bad memory. The original copy can be good or bad; the second copy cannot be better than the first copy. But you can>try to mitigate them. Without doing so makes ZFS less reliable than >the memory it is running on. The problem is that ZFS makes any file >with a bad checksum inaccessible, even if one really doesn''t care >if the data has been corrupted. A workaround might be a way to allow >such files to be readable despite the bad checksum...You can disable the checksums if you don''t care.>But it isn''t. Applications aren''t dying, compilers are not segfaulting >(it was even possible to compile GCC 4.3.2 with the supplied gcc); gdm >is staying up for weeks at a time... And I wouldn''t consider running a >non-trivial database application on a machine without ECC.One broken bit may not have cause serious damage "most things work".>> Absolutely, memory diags are essential. And you certainly run them if >> you see unexpected behaviour that has no other obvious cause. > >Runs for days, as noted.Doesn''t proof anything. Casper
On 05/23/09 10:21, Richard Elling wrote:> <preface> > This forum is littered with claims of "zfs checksums are broken" where > the root cause turned out to be faulty hardware or firmware in the data > path. > </preface> > > I think that before you should speculate on a redesign, we should get to > the root cause.The hardware is clearly misbehaving. No argument. The questions is - how far out of reasonable behavior is it? Redesign? I''m not sure I can conceive an architecture that would make double buffering difficult to do. It is unclear how faulty hardware or firmware could be responsible for such a low error rate (<1 in 4*10^10). Just asking if an option for machines with no ecc and their inevitable memory errors is a reasonable thing to suggest in an RFE.> The checksum occurs in the pipeline prior to write to disk. > So if the data is damaged prior to checksum, then ZFS will > never know. Nor will UFS. Neither will be able to detect > this. In Solaris, if the damage is greater than the ability > of the memory system and CPU to detect or correct, then > even Solaris won''t know. If the memory system or CPU > detects a problem, then Solaris fault management will kick > in and do something, preempting ZFS.Exactly. My whole point. And without ECC there''s no way of knowing. But if the data is damaged /after/ checksum but /before/ write, then you have a real problem...>> Memory diagnostics just test memory. Disk diagnostics just test disks. > > This is not completely accurate. Disk diagnostics also test the > data path. Memory tests also test the CPU. The difference is the > amount of test coverage for the subsystem.Quite. But the disk diagnostic doesn''t really test memory beyond what it uses to run itself. Likewise it may not test the FPU forexample.>> ZFS keeps disks pretty busy, so perhaps it loads the power supply >> to the point where it heats up and memory glitches are more likely. > > In general, for like configurations, ZFS won''t keep a disk any more > busy than other file systems. In fact, because ZFS groups transactions, > it may create less activity than other file systems, such as UFS.That''s a point in it''s favor, although not really relevant. If the disks are really busy they will load the PSU more and that could drag the supply down which in turn might make errors occur that otherwise wouldn''t.>> Ironically, the Open Solaris installer does not allow for ZFS >> mirroring at install time, one time where it might be really important! >> Now that sounds like a more useful RFE, especially since it would be >> relatively easy to implement. Anaconda does it... > > This is not an accurate statement. The OpenSolaris installer does > support mirrored boot disks via the Automated Installer method. > http://dlc.sun.com/osol/docs/content/2008.11/AIinstall/index.html > You can also install Solaris 10 to mirrored root pools via JumpStart.Talking about the live CD here. I prefer to install via jumpstart, but AFAIK Open Solaris (indiana) isn''t available as an installable DVD. But most consumers are going to be installing from the live CD and they are the ones with the low end hardware without ECC. There was recently a suggestion on another thread about an RFE to add mirroring as an install option.> I think a better test would be to md5 the file from all systems > and see if the md5 hashes are the same. If they are, then yes, > the finger would point more in the direction of ZFS. The > send/recv protocol hasn''t changed in quite some time, but it > is arguably not as robust as it could be.Thanks! md5 hash is exactly the kind of test I was looking for. ms5sum on SPARC 9ec4f7da41741b469fcd7cb8c5040564 (local ZFS) md5sum on X86 9ec4f7da41741b469fcd7cb8c5040564 (remote NFS)> ZFS send/recv use fletcher4 for the checksums. ZFS uses fletcher2 > for data (by default) and fletcher4 for metadata. The same fletcher > code is used. So if you believe fletcher4 is broken for send/recv, > how do you explain that it works for the metadata? Or does it? > There may be another failure mode at work here... > (see comment on scrubs at the end of this extended post)[Did you forget the scrubs comment?] Never said it was broken. I assume the same code is used for both SPARC and X86, and it works fine on SPARC. It would seem that this machine gets memory errors so often (even though it passes the Linux memory diagnostic) that it can never get to the end of a 4GB recv stream. Odd that it can do the md5sum, but as mentioned, perhaps doing the i/o puts more strain on the machine and stresses it to where more memory faults occur. I can''t quite picture a software bug that would cause random failures on specific hardware and I am happy to give ZFS the benefit of the doubt.>>> It would have been nice if we were able to recover the contents of the >>> file; if you also know what was supposed to be there, you can diff and >>> then we can find out what was wrong. >> >> "file" on those files resulted in "bus error". Is there a way to actually >> read a file reported by ZFS as unrecoverable to do just that (and to >> separately retrieve the copy from each half of the mirror)? > > ZFS corrects automatically, when it can. But if the source data is > bad, then ZFS couldn''t possibly detect it.> For files that ZFS can detect are corrupted and cannot automatically > correct, you can get the list from "zpool status -xv" The behaviour > as seen by applications is determined by the zpool failmode property.Exactly. And "file" on such a file will repeatably segfault. So will pkg fix (there is a bug reported for this). Fortunately rm doesn''t segfault or there would be no way to repair such files. Is there a way to actually get copies of with bad checksums so they may be examined to see where the fault actually lies? Quoting the ZFS admin guide: "The failmode property ... provides the failmode property for determining the behavior of a catastrophic pool failure due to a loss of device connectivity or the failure of all devices in the pool. ". Has this changed since the ZFS admin guide was last updated? If not, it doesn''t seem relevant.> In any event, if file core dumps consistently in the same part of the > code, then please log a bug against file -- it should not core dump, > no matter what input it receives.Ironically all such files have long since been scrubbed away. I suppose one could deliberately damage a file to reproduce this. It could also be that a library required to /run/ file was the one that was damaged...> Uhmm, if it were a software bug, one would expect it to fail > at exactly the same place, no?Exactly. Not a bug. If it were, it would have been fixed a long time ago on such a critical path. How about an RFE along the lines of "Improved support for machines without ecc memory"? How about one to recover files with bad checksums (a bit like getting fragments out oflost+found in the bad old days)?> Yep, interesting question. But since you say "even zpool status > shows no error at all after a couple of scrubs" makes me think > that you''ve had errors in the past?You bet! 5 unrecoverable errors, and maybe 10 or so recoverable ones. About once a month, zpool status shows an error (note this machine is being used as an X-terminal, so it hardly does any i/o) and a scrub gets rid of it.> I''m still a little confused. If ext3 can''t detect data errors, what > verification have you used to back your claim that it is unaffected?None at all. But in a read-mostly environment this isn''t an issue. Other, known, bugs (in Fedora) account for almost every crash, and Solaris hasn''t failed once since it was (finally) installed a few weeks ago with the screensaver disabled :-).> Please check the image views with md5 digests and get back to us. > If you get a chance, run SunVTS to verify the memory and CPU, > too. If the CPU is b0rken, the fletcher4 checksum for the recv may > be tickling it.If the CPU was broken, wouldn''t it always fail at the same point in the stream? It definitely doesn''t. Could you expand a little on what it means to do md5sums on the image views? I''m not sure what an image view is in this context. AFAIK SUNWvts is available only in SXCE, not in Open Solaris. Oddly, you can load SUNWvts via pkg, but evidently not smcwebserver - please correct me if I am wrong. FWIW we are running SXCE on SPARC (installed via jumpstart) and indiana on X86 (installed via live CD and updated to snv111a via pkg.> <sidebar> > Microsoft got so tired of defending its software against memory > errors, that it requires Windows Server platforms to use ECC. But > even Microsoft doesn''t have the power to force the vendors to use > ECC for all PCs. > </sidebar>Quite. My point exactly! My only issue is that I have experienced what is IMO an unreasonably large number of unrecoverable errors on mirrored drives. I was merely speculating on reasons for this and possible solutions. Ironically, my applications are running beautifully, and the users are quite happy with the performance and stability. ZFS is wonderful because updates are so easy to roll back and painless to install, snapshots are so useful, and all the other reasons that make every other fs seem so antiquated... In a sense, the proposal is merely to replicate in software what ECC does in hardware. There may be much better solutions than double buffering the data, and doing it at the level of ZFS is not a complete solution. But doing nothing exposes ZFS users of mirrored drives to the likelihood of unnecessarily unrecoverable failures due to statistically probable memory glitches on machines with no ecc. Cheers -- Frank
On 05/26/09 03:23, Casper.Dik at Sun.COM wrote:> And where exactly do you get the second good copy of the data?From the first. And if it is already bad, as noted previously, this is no worse than the UFS/ext3 case. If you want total freedom from this class of errors, use ECC.> If you copy the code you''ve just doubled your chance of using bad memory. > The original copy can be good or bad; the second copy cannot be better > than the first copy.The whole point is that the memory isn''t bad. About once a month, 4GB of memory of any quality can experience 1 bit being flipped, perhaps more or less often. If that bit happens to be in the checksummed buffer then you''ll get an unrecoverable error on a mirrored drive. And if I understand correctly, ZFS keeps data in memory for a lot longer than other file systems and uses more memory doing so. Good features, but makes it more vulnerable to random bit flips. This is why decent machine have ECC. To argue that ZFS should work reliably on machines without ECC flies in the face of statistical reality and the reason for ECC in the first place.> You can disable the checksums if you don''t care.But I do care. I''d like to know if my files have been corrupted, or at least as much as possible. But there are huge classes of files for which the odd flipped bit doesn''t matter and the loss of which would be very painful. Email archives and videos come to mind. An easy workaround is to simply store all important stuff on a machine with ECC. Problem solved...> One broken bit may not have cause serious damage "most things work".Exactly.>>> Absolutely, memory diags are essential. And you certainly run them if >>> you see unexpected behaviour that has no other obvious cause. >> Runs for days, as noted. > > Doesn''t proof anything.Quite. But nonetheless, the unrecoverable errors did occur on mirrored drives and it seems to defeat the whole purpose of mirroring, which is AFAIK, keeping two independent copies of every file in case one gets lost. Writing both images from one buffer appears to violate the premise. I can think of two RFEs 1) Add an option to buffer writes on machines without ECC memory to avoid the possibility of random memory flips causing unrecoverable errors with mirrored drives. 2) An option to read files even if they have failed checksums. 1) could be fixed in the documentation - "ZFS should be used with caution on machines with no ECC since random bit flips can cause unrecoverable checksum failures on mirrored drives". Or "ZFS isn''t supported on machines with memory that has no ECC". Disabling checksums is one way of working around 2). But it also disables a cool feature. I suppose you could optionally change checksum failure from an error to a warning, but ideally it would be file by file... Ironically, I wonder if this is even a problem with raidz? But grotty machines like these can''t really support 3 or more internal drives... Cheers -- Frank
On Tue, 26 May 2009, Frank Middleton wrote:> Just asking if an option for machines with no ecc and their inevitable > memory errors is a reasonable thing to suggest in an RFE.Machines lacking ECC do not suffer from "inevitable memory errors". Memory errors are not like death and taxes.> Exactly. My whole point. And without ECC there''s no way of knowing. > But if the data is damaged /after/ checksum but /before/ write, then > you have a real problem...If memory does not work, then you do have a real problem. The ZFS ARC consumes a large amount of memory. Note that the problem of corruption around the time of the checksum/write is minor compared to corruption in the ZFS ARC since data is continually read from the ZFS ARC and so bad data may be returned to the user even though it is (was?) fine on disk. This is as close as ZFS comes to having an Achilles'' heel. Solving this problem would require crippling the system performance.> Never said it was broken. I assume the same code is used for both SPARC > and X86, and it works fine on SPARC. It would seem that this machine > gets memory errors so often (even though it passes the Linux memory > diagnostic) that it can never get to the end of a 4GB recv stream. OddMaybe you need a new computer, or need to fix your broken one. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn wrote:> On Tue, 26 May 2009, Frank Middleton wrote: >> Just asking if an option for machines with no ecc and their inevitable >> memory errors is a reasonable thing to suggest in an RFE. > > Machines lacking ECC do not suffer from "inevitable memory errors". > Memory errors are not like death and taxes. > >> Exactly. My whole point. And without ECC there''s no way of knowing. >> But if the data is damaged /after/ checksum but /before/ write, then >> you have a real problem... > > If memory does not work, then you do have a real problem. The ZFS ARC > consumes a large amount of memory. Note that the problem of corruption > around the time of the checksum/write is minor compared to corruption in > the ZFS ARC since data is continually read from the ZFS ARC and so bad > data may be returned to the user even though it is (was?) fine on disk. > This is as close as ZFS comes to having an Achilles'' heel. Solving this > problem would require crippling the system performance.When running a DEBUG kernel (not something most people would do on a "production" system) ZFS does actually checksum and verify the buffers in the ARC - not on every access but certain operations cause it to happen. -- Darren J Moffat
On Tue, 26 May 2009, Frank Middleton wrote:> > 1) could be fixed in the documentation - "ZFS should be used with caution > on machines with no ECC since random bit flips can cause unrecoverable > checksum failures on mirrored drives". Or "ZFS isn''t supported on > machines with memory that has no ECC".What problem are you looking to solve? Data is written by application software which includes none of the extra safeguards you are insisting should be in ZFS. This means that the data may be undetectably corrupted. I strongly recommend that you purchase a system with ECC in order to operate reliably in the (apparent) radium mine where you live. It is time to wake up, smell the radon, and do something about the problem. Check this map to see if there is cause for concern in your area: "http://upload.wikimedia.org/wikipedia/en/8/8b/US_homes_over_recommended_radon_levels.gif". Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 25-May-09, at 11:16 PM, Frank Middleton wrote:> On 05/22/09 21:08, Toby Thain wrote: >> Yes, the important thing is to *detect* them, no system can run >> reliably >> with bad memory, and that includes any system with ZFS. Doing nutty >> things like calculating the checksum twice does not buy anything of >> value here. > > All memory is "bad" if it doesn''t have ECC. There are only varying > degrees of badness. Calculating the checksum twice on its own would > be nutty, as you say, but doing so on a separate copy of the data > might prevent unrecoverable errorsI don''t see this at all. The kernel reads the application buffer. How does reading it twice buy you anything?? It sounds like you are assuming 1) the buffer includes faulty RAM; and 2) the faulty RAM reads differently each time. Doesn''t that seem statistically unlikely to you? And even if you really are chasing this improbable scenario, why make ZFS do the job of a memory tester?> after writes to mirrored drives. > You can''t detect memory errors if you don''t have ECC. But you can > try to mitigate them. Without doing so makes ZFS less reliable than > the memory it is running on. The problem is that ZFS makes any file > with a bad checksum inaccessible, even if one really doesn''t care > if the data has been corrupted. A workaround might be a way to allow > such files to be readable despite the bad checksum...I am not sure what you are trying to say here.> > ... > >> How can a machine with bad memory "work fine with ext3"? > > It does. It works fine with ZFS too. Just really annoying > unrecoverable > files every now and then on mirrored drives. This shouldn''t happen > even > with lousy memory and wouldn''t (doesn''t) with ECC. If there was a way > to examine the files and their checksums, I would be surprised if they > were different (If they were, it would almost certainly be the > controller > or the PCI bus itself causing the problem). But I speculate that it is > predictable memory hits.You''re making this harder than it really is. Run a memory test. If it fails, take the machine out of service until it''s fixed. There''s no reasonable way to keep running faulty hardware. --Toby> > -- Frank >
On 26-May-09, at 10:21 AM, Frank Middleton wrote:> On 05/26/09 03:23, Casper.Dik at Sun.COM wrote: > >> And where exactly do you get the second good copy of the data? > > From the first. And if it is already bad, as noted previously, this > is no worse than the UFS/ext3 case. If you want total freedom from > this class of errors, use ECC. > >> If you copy the code you''ve just doubled your chance of using bad >> memory. >> The original copy can be good or bad; the second copy cannot be >> better >> than the first copy. > > The whole point is that the memory isn''t bad. About once a month, 4GB > of memory of any quality can experience 1 bit being flipped, perhaps > more or less often.What you are proposing does practically nothing to mitigate "random bit flips". Think about the probabilities involved. You''re testing one tiny buffer, very occasionally, for an extremely improbable event. It is also nothing to do with ZFS, and leaves every other byte of your RAM untested. See the reasoning? --Toby> ... > > Cheers -- Frank >
Frank brings up some interesting ideas, some of which might need some additional thoughts... Frank Middleton wrote:> On 05/23/09 10:21, Richard Elling wrote: >> <preface> >> This forum is littered with claims of "zfs checksums are broken" where >> the root cause turned out to be faulty hardware or firmware in the data >> path. >> </preface> >> >> I think that before you should speculate on a redesign, we should get to >> the root cause. > > The hardware is clearly misbehaving. No argument. The questions is - how > far out of reasonable behavior is it?Hardware is much less expensive than software, even free software. Your system has a negative ROI, kinda like trading credit default swaps. The best thing you can do is junk it :-)> > Redesign? I''m not sure I can conceive an architecture that would make > double buffering difficult to do. It is unclear how faulty hardware or > firmware could be responsible for such a low error rate (<1 in 4*10^10). > Just asking if an option for machines with no ecc and their inevitable > memory errors is a reasonable thing to suggest in an RFE.It is a good RFE, but it isn''t an RFE for the software folks.>> The checksum occurs in the pipeline prior to write to disk. >> So if the data is damaged prior to checksum, then ZFS will >> never know. Nor will UFS. Neither will be able to detect >> this. In Solaris, if the damage is greater than the ability >> of the memory system and CPU to detect or correct, then >> even Solaris won''t know. If the memory system or CPU >> detects a problem, then Solaris fault management will kick >> in and do something, preempting ZFS. > > Exactly. My whole point. And without ECC there''s no way of knowing. > But if the data is damaged /after/ checksum but /before/ write, then > you have a real problem...To put this in perspective, ECC is a broad category. When we think of ECC for memory, it is usually Single Error (bit) Correction, Double Error (bit) Detection (SECDED). A well designed system will also do Single Device Data Correction (aka Chipkill or Extended ECC, since Chipkill is trademarked). What this means is that faults of more than 2 bits per word are not detected, unless all of the faults occur in the same chip for SDDC cases. Clearly, this wouldn''t scale well to large data streams, which is why they use checksums like Fletcher or hash functions like SHA-256.>>> ZFS keeps disks pretty busy, so perhaps it loads the power supply >>> to the point where it heats up and memory glitches are more likely. >> >> In general, for like configurations, ZFS won''t keep a disk any more >> busy than other file systems. In fact, because ZFS groups transactions, >> it may create less activity than other file systems, such as UFS. > > That''s a point in it''s favor, although not really relevant. If the disks > are really busy they will load the PSU more and that could drag the > supply > down which in turn might make errors occur that otherwise wouldn''t.The dynamic loads of modern disk drives are not very great. I don''t believe your argument is very strong, here. Also, the solution is, once again, fix the hardware.>> I think a better test would be to md5 the file from all systems >> and see if the md5 hashes are the same. If they are, then yes, >> the finger would point more in the direction of ZFS. The >> send/recv protocol hasn''t changed in quite some time, but it >> is arguably not as robust as it could be. > > Thanks! md5 hash is exactly the kind of test I was looking for. > ms5sum on SPARC 9ec4f7da41741b469fcd7cb8c5040564 (local ZFS) > md5sum on X86 9ec4f7da41741b469fcd7cb8c5040564 (remote NFS)Good.>> ZFS send/recv use fletcher4 for the checksums. ZFS uses fletcher2 >> for data (by default) and fletcher4 for metadata. The same fletcher >> code is used. So if you believe fletcher4 is broken for send/recv, >> how do you explain that it works for the metadata? Or does it? >> There may be another failure mode at work here... >> (see comment on scrubs at the end of this extended post) > [Did you forget the scrubs comment?]no, you responded that you had been seeing scrubs fix errors.> Never said it was broken. I assume the same code is used for both SPARC > and X86, and it works fine on SPARC. It would seem that this machine > gets memory errors so often (even though it passes the Linux memory > diagnostic) that it can never get to the end of a 4GB recv stream. Odd > that it can do the md5sum, but as mentioned, perhaps doing the i/o > puts more strain on the machine and stresses it to where more memory > faults occur. I can''t quite picture a software bug that would cause > random failures on specific hardware and I am happy to give ZFS the > benefit of the doubt.Yes, software can trigger memory failures. More below...>>>> It would have been nice if we were able to recover the contents of the >>>> file; if you also know what was supposed to be there, you can diff and >>>> then we can find out what was wrong. >>> >>> "file" on those files resulted in "bus error". Is there a way to >>> actually >>> read a file reported by ZFS as unrecoverable to do just that (and to >>> separately retrieve the copy from each half of the mirror)? >> >> ZFS corrects automatically, when it can. But if the source data is >> bad, then ZFS couldn''t possibly detect it. > >> For files that ZFS can detect are corrupted and cannot automatically >> correct, you can get the list from "zpool status -xv" The behaviour >> as seen by applications is determined by the zpool failmode property. > > Exactly. And "file" on such a file will repeatably segfault. So will > pkg fix (there is a bug reported for this). Fortunately rm doesn''t > segfault or there would be no way to repair such files. Is there > a way to actually get copies of with bad checksums so they may be > examined to see where the fault actually lies?Yes, to some degree. See a few of the blogs in this collection http://blogs.sun.com/relling/entry/holy_smokes_a_holey_file> Quoting the ZFS admin guide: "The failmode property ... provides the > failmode property for determining the behavior of a catastrophic pool > failure due to a loss of device connectivity or the failure of all > devices in the pool. ". Has this changed since the ZFS admin guide > was last updated? If not, it doesn''t seem relevant.It is relevant in those cases where you want a process to continue though the hardware has failed. Rather than panic, you can get an EIO.>> Uhmm, if it were a software bug, one would expect it to fail >> at exactly the same place, no? > > Exactly. Not a bug. If it were, it would have been fixed a long time > ago on such a critical path. How about an RFE along the lines of > "Improved support for machines without ecc memory"? How about one > to recover files with bad checksums (a bit like getting fragments > out oflost+found in the bad old days)?argv! Why does this keep coming up? UFS fsck does not recover data! It only recovers metadata, sometimes.>> Yep, interesting question. But since you say "even zpool status >> shows no error at all after a couple of scrubs" makes me think >> that you''ve had errors in the past? > > You bet! 5 unrecoverable errors, and maybe 10 or so recoverable > ones. About once a month, zpool status shows an error (note this > machine is being used as an X-terminal, so it hardly does any i/o) > and a scrub gets rid of it.heh, if the fault is in memory, then the scrub will be correcting correct data :-)>> Please check the image views with md5 digests and get back to us. >> If you get a chance, run SunVTS to verify the memory and CPU, >> too. If the CPU is b0rken, the fletcher4 checksum for the recv may >> be tickling it. > > If the CPU was broken, wouldn''t it always fail at the same point in > the stream?Not necessarily. All failure modes are mechanical. There are a class of failure modes in semiconductors which are due to changes in the speed of transistors as a function of temperature. Temperature increases as a function of the frequency of input changes in a CMOS gate. So, if your software causes a specific change in the temperature of a portion of a device, then it could trip on a temperature-induced fault. These tend to be rare because of the margins, but if the hardware is flaky, it is already arguably beyond the margins. These sorts of codes might be humorously classified as halt-and-catch-fire. But they do exist, and there are some cool thermographs which show how the heat is distributed for various workloads. http://en.wikipedia.org/wiki/Halt_and_Catch_Fire> It definitely doesn''t. Could you expand a little on what > it means to do md5sums on the image views? I''m not sure what an image > view is in this context. AFAIK SUNWvts is available only in SXCE, not > in Open Solaris. Oddly, you can load SUNWvts via pkg, but evidently > not smcwebserver - please correct me if I am wrong. FWIW we are running > SXCE on SPARC (installed via jumpstart) and indiana on X86 (installed > via live CD and updated to snv111a via pkg. > >> <sidebar> >> Microsoft got so tired of defending its software against memory >> errors, that it requires Windows Server platforms to use ECC. But >> even Microsoft doesn''t have the power to force the vendors to use >> ECC for all PCs. >> </sidebar> > > Quite. My point exactly! My only issue is that I have experienced > what is IMO an unreasonably large number of unrecoverable errors on > mirrored drives. I was merely speculating on reasons for this and > possible solutions. Ironically, my applications are running beautifully, > and the users are quite happy with the performance and stability. ZFS > is wonderful because updates are so easy to roll back and painless > to install, snapshots are so useful, and all the other reasons that > make every other fs seem so antiquated...There may be an opportunity here. Let''s assume that your disks were fine and the bad checksums were caused by transient memory faults. In such cases, a re-read of the data would effectively clear the transient fault. In a sense, this is where mirroring works against us -- ZFS will attempt to repair. This brings up a lot of much more complex system issues, which makes me glad that FMA exists ;-) -- richard
Frank Middleton <f.middleton at apogeect.com> writes:> Exactly. My whole point. And without ECC there''s no way of knowing. > But if the data is damaged /after/ checksum but /before/ write, then > you have a real problem...we can''t do much to protect ourselves from damage to the data itself (an extra copy in RAM will help little and ruin performance). damages to the bits holding the computed checksum before it is written can be alleviated by doing the calculation independently for each written copy. in particular, this will help if the bit error is transient. since the number of octets in RAM holding the checksum dwarves the number of octets occupied by data by a large ratio (256 bits vs. one mebibit for a full default sized record), such a paranoia mode will most likely tell you that the *data* is corrupt, not the checksum. but today you don''t know, so it''s an improvement in my book.> Quoting the ZFS admin guide: "The failmode property ... provides the > failmode property for determining the behavior of a catastrophic > pool failure due to a loss of device connectivity or the failure of > all devices in the pool. ". Has this changed since the ZFS admin > guide was last updated? If not, it doesn''t seem relevant.I guess checksum error handling is orthogonal to this and should have its own property. it sure would be nice if the admin could ask the OS to deliver the bits contained in a file, no matter what, and just log the problem.> Cheers -- Frankthank you for pointing out this potential weakness in ZFS'' consistency checking, I didn''t realise it was there. also thank you, all ZFS developers, for your great job :-) -- Kjetil T. Homme Redpill Linpro AS - Changing the game
On 05/26/09 13:07, Kjetil Torgrim Homme wrote:> also thank you, all ZFS developers, for your great job :-)I''ll second that! A great achievement - puts Solaris in a league of its own, so much so, you''d want to run it on all your hardware, however crappy the hardware might be ;-) There are too many branches in this thread now. Going to summarize here without responding to some of the less than helpful comments, although death and taxes seems an ironic metaphor in the current climate :-) In some ways this isn''t a technical issue. This much maligned machine and its ilk are running Solaris and ZFS quite happily and the users are pleased with the stability and performance. But their applications are running on machines (via xdmcp) with ECC, and ZFS mirror/raidz doesn''t have a problem there. Picture a new convert with enthusiasm for ZFS, but has a less than perfect PC which has otherwise been apparently quite reliable. Perhaps it already has mirrored drives. He/she installs Solaris from the live CD (and finds that the installer doesn''t support mirroring). The install fails, or worse, afterwords he/she loses that movie of Aunt Minnie playing golf, because a checksum error makes the file unrecoverable. This could be very frustrating and make the blogosphere go crazy, especially if the PC passes every diagnostic. Be even worse if a file is lost on a mirror. Unrecoverable files on mirrored drives simply shouldn''t happen. What kind of hardware error (other than a rare bit flip) could conceivably cause 5 out of 15 checksum errors to be unrecoverable when mirrored during the write of around 20*10^10 bits? ZFS has both a larger spatial and temporal footprint than other file systems, so it is slightly more vulnerable to the once-a-month on average bit flip that will afflict many a PC with 4GB of memory. Perhaps someone with a statistical bent could step in and actually calculate the probability of random errors, perhaps assuming that half of available memory is used to queue writes, that there is a 95% chance of one bit flip per month per 4GB, and there is a (say) 10% duty cycle over say a period of a year. Alternatively, the chance of a 1 bit flip over a period of 6 hours at a 100% duty cycle repeated 1461 times (1461 installs per year at 100%). Seems to me intuitively that 6 out of 1461 installs will fail due to an unrecoverable checksum failure, but I''m not a statistician. Multiply that failure rate by the number of Live CD installs you expect over the next year (noting that *all* checksum failures are unrecoverable without mirroring) and you''ll count quite a few frustrated would-be installers. Maybe ZFS without ECC and no mirroring should disable checksumming by default - it would be a little worse than UFS and ext3 (due to its larger spatial and temporal footprints) but still provide all the other great features. Proposed RFE #1 Add option to make files with unrecoverable checksum failures readable and to pass the best image possible back to the application. [How much do you bet most folks would select this option?] If both sides of the mirror could be read, it might help to diagnose the problem, which obviously must be in the hardware somewhere. If both images are identical, then it surely must be memory. If they differ, then what could it be? Proposed RFE#2 Add an option for machines with mirrored drives but without ECC to double buffer and only then calculate the checksums (for those who are reasonably paranoid about cosmic rays). Proposed RFE#3 (or is this a bug report?) Add diagnostics to the ZFS recv to help understand why a perfectly good ZFS send can''t be received when the same machine can successfully compute a md5sum over the same stream. Even something like "recv failed at block nnnnnnn" would help. For example, it seems to fail suspiciously close to 2GB on a 32 bit machine. Proposed RFE #4 Disable checksumming by default if no mirroring and no ECC is detected. (Of course this assumes a install to mirror option). If it could still checksum, but make it a warning instead of an error, this could turn into a great feature for cheapskates with machines that have no ECC. --- 1 and #2 above could be fixed in the documentation. "Random memory bit flips can theoretically cause unrecoverable checksum failures, even if the data is mirrored. Either disable the checksum feature or only run ZFS on systems with ECC memory if you have any data you don''t want to risk losing [even with a 1 bit error]". None of this is meant as a criticism of ZFS, just suggestions to help make a merely superb file system into the unbeatable one it should be. (I suppose it really is a system of file systems, but ZFS it is...) Regards -- Frank