Can anyone say what the status of CR 6880994 (kernel/zfs Checksum failures on mirrored drives) might be? Setting copies=2 has mitigated the problem, which manifests itself consistently at boot by flagging libdlpi.so.1, but two recent power cycles in a row with no normal shutdown has resulted in a "permanent error" even with copies=2 on all of the root pool (and specifically having duplicated /lib to make sure there are 2 copies). How can it even be remotely possible to get a checksum failure on mirrored drives with copies=2? That means all four copies were corrupted? Admittedly this is on a grotty PC with no ECC and flaky bus parity, but how come the same file always gets flagged as being clobbered (even though apparently it isn''t). The oddest part is that libdlpi.so.1 doesn''t actually seem to be corrupted. nm lists it with no problem and you can copy it to /tmp, rename it, and then copy it back. objdump and readelf can all process this library with no problem. But "pkg fix" flags an error in it''s own inscrutable way. CCing pkg-discuss in case a pkg guru can shed any light on what the output of "pkg fix" (below) means. Presumably libc is OK, or it wouldn''t boot :-). This with b125 on X86. # zpool status -v pool: rpool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 2 c3d1s0 ONLINE 0 0 2 c3d0s0 ONLINE 0 0 2 errors: Permanent errors have been detected in the following files: //lib/libdlpi.so.1 # pkg fix SUNWcsl Verifying: pkg://opensolarisdev/SUNWcsl ERROR file: lib/libc.so.1 Elfhash: cbb55a2ea24db9e03d9cd08c25b20406896c2fef should be 0e73a56d6ea0753f3721988ccbd716e370e57c4e Created ZFS snapshot: 2010-03-13-23:39:17 ..... || Repairing: pkg://opensolarisdev/SUNWcsl pkg: Requested "fix" operation would affect files that cannot be modified in live image. Please retry this operation on an alternate boot environment # nm /lib/libdlpi.so.1 00015562 b Bbss.bss 00015562 b Bbss.bss 00015240 d Ddata.data 00015240 d Ddata.data 000152f8 d Dpicdata.picdata 00003ca8 r Drodata.rodata 00003ca0 r Drodata.rodata 00000000 A SUNW_1.1 00000000 A SUNWprivate 000150ac D _DYNAMIC 00015562 b _END_ 00015000 D _GLOBAL_OFFSET_TABLE_ 000016c0 T _PROCEDURE_LINKAGE_TABLE_ 00000000 r _START_ U ___errno U __ctype U __div64 00015562 D _edata 00015562 B _end 000043d7 R _etext 00003c84 t _fini U _fxstat 00003c68 t _init 00003ca0 r _lib_version U _lxstat U _xmknod U _xstat U abs U calloc U close U closedir U dgettext U dladm_close U dladm_dev2linkid U dladm_open U dladm_parselink U dladm_phys_info U dladm_walk 00002d5c T dlpi_arptype 0000222c T dlpi_bind 00001d6c T dlpi_close 000024d0 T dlpi_disabmulti 00002c78 T dlpi_disabnotify 000024b0 T dlpi_enabmulti 00002af4 T dlpi_enabnotify 00015288 d dlpi_errlist 00002ce4 T dlpi_fd 000025b8 T dlpi_get_physaddr 00002e50 T dlpi_iftype 00001dc8 T dlpi_info 00002d2c T dlpi_linkname 000039a8 T dlpi_mactype 000152f8 d dlpi_mactypes 000021a0 T dlpi_makelink 00001b00 T dlpi_open 00002158 T dlpi_parselink 00003ca8 r dlpi_primsizes 00002598 T dlpi_promiscoff 00002578 T dlpi_promiscon 000028fc T dlpi_recv 000027a4 T dlpi_send 000026d4 T dlpi_set_physaddr 00002d04 T dlpi_set_timeout 00003908 T dlpi_strerror 00002d48 T dlpi_style 00002384 T dlpi_unbind 00001a20 T dlpi_walk U free 00001998 t fstat U getenv U gethrtime U getmsg 000032fc t i_dlpi_attach 00003a28 t i_dlpi_buildsap 000032ac t i_dlpi_checkstyle 00003bfc t i_dlpi_deletenotifyid 000039e8 t i_dlpi_getprimsize 00003868 t i_dlpi_msg_common 000023f4 t i_dlpi_multi 00003bd4 t i_dlpi_notifyidexists 00003ac8 t i_dlpi_notifyind_process 00002f28 t i_dlpi_open 00003384 t i_dlpi_passive 000024e8 t i_dlpi_promisc 00003460 t i_dlpi_strgetmsg 000033e4 t i_dlpi_strputmsg 0000316c t i_dlpi_style1_open 000031f0 t i_dlpi_style2_open 000019f0 t i_dlpi_walk_link 00003a9c t i_dlpi_writesap U ifparse_ifspec U ioctl 00015240 d libdlpi_errlist 0000196c t lstat U memcpy U memset 000019c4 t mknod U open U opendir U poll U putmsg U readdir U snprintf 00001940 t stat U strchr U strerror U strlcpy U strlen
Frank Middleton wrote:> But "pkg fix" flags an error in it''s own inscrutable way. CCing > pkg-discuss in case a pkg guru can shed any light on what the output of > "pkg fix" (below) means. Presumably libc is OK, or it wouldn''t boot :-).The "problem" with libc here is that while /lib/libc.so.1 is delivered as one set of contents, /usr/lib/libc/libc_hwcap*.so.1 is mounted on top. Until bug 7926 is fixed, "pkg verify" doesn''t look underneath the mountpoint and so thinks that libc has the wrong bits. Danek
On Sun, March 14, 2010 13:54, Frank Middleton wrote:> > How can it even be remotely possible to get a checksum failure on mirrored > drives > with copies=2? That means all four copies were corrupted? Admittedly this > is > on a grotty PC with no ECC and flaky bus parity, but how come the same > file always > gets flagged as being clobbered (even though apparently it isn''t). > > The oddest part is that libdlpi.so.1 doesn''t actually seem to be > corrupted. nm lists > it with no problem and you can copy it to /tmp, rename it, and then copy > it back. > objdump and readelf can all process this library with no problem. But "pkg > fix" > flags an error in it''s own inscrutable way. CCing pkg-discuss in case a > pkg guru > can shed any light on what the output of "pkg fix" (below) means. > Presumably libc > is OK, or it wouldn''t boot :-).This sounds really bizarre. One detail suggestion on checking what''s going on (since I don''t have a clue towards a real root-cause determination): Get an md5sum on a clean copy of the file, say from a new install or something, and check the allegedly-corrupted copy against that. This can fairly easily give you a pretty reliable indication if the file is truly corrupted or not. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On 03/15/10 01:01 PM, David Dyer-Bennet wrote:> This sounds really bizarre.Yes, it is. ButCR 6880994 is bizarre too.> One detail suggestion on checking what''s going on (since I don''t have a > clue towards a real root-cause determination): Get an md5sum on a clean > copy of the file, say from a new install or something, and check the > allegedly-corrupted copy against that. This can fairly easily give you a > pretty reliable indication if the file is truly corrupted or not.With many thanks to Danek Duvall, I got a new copy of libdlpi.so.1 # md5sum /lib/libdlpi.so.1 2468392ff87b5810571572eb572d0a41 /lib/libdlpi.so.1 # md5sum /lib/libdlpi.so.1.orig 2468392ff87b5810571572eb572d0a41 /lib/libdlpi.so.1.orig # zpool status -v .... errors: Permanent errors have been detected in the following files: //lib/libdlpi.so.1.orig So here we seem to have an example of a ZFS false positive, the first I''ve see or heard of. The good news is that it is still possible to read the file, so this augers well for the ability to boot under this circumstance. FWIW fmdump does seem to show show actual checksum errors on all four copies in 16 attempts to read them. There were 3 groups of different bad checksums; within each group the checksum was the same but differed from the expected. Perhaps someone who can could add this to CR 6880994 in the hopes that it might help lead to a better understanding. For the casual reader, CR 6880994 is about a pathological PC that gets checksum errors on the same set of files at boot, even though the root pool is mirrored. With copies=2, usually ZFS can repair them. But after a recent power cycle, all 4 copies reported bad checksums but in reality the the file seems to be uncorrupted. The machine has no ECC and flaky bus parity, so there are plenty of ways for the data to get messed up. It''s a mystery why this only happens at boot, though.
On Mar 21, 2010, at 11:03 AM, Frank Middleton wrote:> On 03/15/10 01:01 PM, David Dyer-Bennet wrote: > >> This sounds really bizarre. > > Yes, it is. ButCR 6880994 is bizarre too.Rolling back to a conversation with Frank last fall, here is the output of fmdump which shows the single bit flip. Extra lines elided. TIME CLASS Oct 23 2009 14:53:01.525657508 ereport.fs.zfs.checksum class = ereport.fs.zfs.checksum pool = rpool vdev_guid = 0x509094f6dc795c97 vdev_type = disk vdev_path = /dev/dsk/c3d0s0 vdev_devid = id1,cmdk at AMaxtor_6Y080L0=Y32HE6XE/a parent_guid = 0x323cf9d672c3b05a parent_type = mirror zio_err = 50 zio_offset = 0x50384800 zio_size = 0x9800 zio_objset = 0x29 zio_object = 0x1a209 zio_level = 0 zio_blkid = 0x0 cksum_expected = 0x4a027c11b3ba4cec 0xbf274565d5615b7b 0x3ef5fe61b2ed672e 0xec8692f7fd33094a cksum_actual = 0x4a027c11b3ba4cec 0xbf274567d5615b7b 0x3ef5fe61b2ed672e 0xec86a5b3fd33094a cksum_algorithm = fletcher2 bad_ranges = 0x228 0x230 bad_ranges_min_gap = 0x8 bad_range_sets = 0x1 bad_range_clears = 0x0 bad_set_bits = 0x0 0x0 0x0 0x0 0x2 0x0 0x0 0x0 bad_cleared_bits = 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 Here we see that one bit was set, 0x2 when we expected 0x0. Later that same day... Oct 23 2009 14:53:01.525657152 ereport.fs.zfs.checksum class = ereport.fs.zfs.checksum pool = rpool pool_guid = 0x5062a7a7247652b1 vdev_guid = 0x1181c8516c0dc9b0 vdev_type = disk vdev_path = /dev/dsk/c3d1s0 vdev_devid = id1,cmdk at AWDC_WD800BB-00BSA0=WD-WMA6S1025599/a parent_guid = 0x323cf9d672c3b05a parent_type = mirror zio_err = 50 zio_offset = 0x50384800 zio_size = 0x9800 zio_objset = 0x29 zio_object = 0x1a209 zio_level = 0 zio_blkid = 0x0 cksum_expected = 0x4a027c11b3ba4cec 0xbf274565d5615b7b 0x3ef5fe61b2ed672e 0xec8692f7fd33094a cksum_actual = 0x4a027c11b3ba4cec 0xbf274567d5615b7b 0x3ef5fe61b2ed672e 0xec86a5b3fd33094a cksum_algorithm = fletcher2 bad_ranges = 0x228 0x230 bad_ranges_min_gap = 0x8 bad_range_sets = 0x1 bad_range_clears = 0x0 bad_set_bits = 0x0 0x0 0x0 0x0 0x2 0x0 0x0 0x0 bad_cleared_bits = 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 So we see the exact same bit flipped (0x2 expecting 0x0) on two different disks, /dev/dsk/c3d0s0 (Maxtor) and /dev/dsk/c3d1s0 (Western Digital), at the same zio offset and size. I feel confident we are not seeing a b0rken drive here. But something is clearly amiss and we cannot rule out the processor, memory, or controller. Frank reports that he sees this on the same file, /lib/libdlpi.so.1, so I''ll go out on a limb and speculate that there is something in the bit pattern for that file that intermittently triggers a bit flip on this system. I''ll also speculate that this error will not be reproducible on another system. This sort of specific error analysis is possible after b125. See CR6867188 for more details. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
On 03/21/10 03:24 PM, Richard Elling wrote:> I feel confident we are not seeing a b0rken drive here. But something is > clearly amiss and we cannot rule out the processor, memory, or controller.Absolutely no question of that, otherwise this list would be flooded :-). However, the purpose of the post wasn''t really to diagnose the hardware but to ask about the behavior of ZFS under certain error conditions.> Frank reports that he sees this on the same file, /lib/libdlpi.so.1, so I''ll go out > on a limb and speculate that there is something in the bit pattern for that > file that intermittently triggers a bit flip on this system. I''ll also speculate that > this error will not be reproducible on another system.Hopefully not, but you never know :-). However, this instance is different. The example you quote shows both expected and actual checksums to be the same. This time the expected and actual checksums are different and fmdump isn''t flagging any bad_ranges or set-bits (the behavior you observed is still happening, but orthogonal to this instance at different times and not always on this file). Since file itself is OK, and the expected checksums are always the same, neither the file nor the metatdata appear to be corrupted, so it appears that both are making it into memory without error. It would seem therefore that it is the actual checksum calculation that is failing. But, only at boot time, the calculated (bad) checksums differ (out of 16, 10, 3, and 3 are the same [1]) so it''s not consistent. At this point it would seem to be cpu or memory, but why only at boot? IMO it''s an old and feeble power supply under strain pushing cpu or memory to a margin not seen during "normal" operation, which could be why diagnostics never see anything amiss (and the importance of a good power supply). FWIW the machine passed everything vts could throw at it for a couple of days. Anyone got any suggestions for more targeted diagnostics? There were several questions embedded in the original post, and I''m not sure any of them have really been answered: o Why is the file flagged by ZFS as fatally corrupted still accessiible? [is this new behavior from b111b vs b125?]. o What possible mechanism could there be for the /calculated/ checksums of /four/ copies of just one specific file to be bad and no others? o Why did this only happen at boot to just this one file which also is peculiarly subject to the bitflips you observed, also mostly at boot (sometimes at scrub)? I like the feeble power supply answer, but why just this one file? Bizarre... # zpool get failmode rpool NAME PROPERTY VALUE SOURCE rpool failmode wait default This machine is extremely memory limited, so I suspect that libdlpi.so.1 is not in a cache. Certainly, a brand new copy wouldn''t be, and there''s no problem writing and (much later) reading the new copy (or the old one, for that matter). It remains to be seen if the brand new copy gets clobbered at boot (the machine, for all it''s faults, remains busily up and operational for months at a time). Maybe I should schedule a reboot out of curiosity :-).> This sort of specific error analysis is possible after b125. See CR6867188 > for more details.Wasn''t this in b125? IIRC we upgraded to b125 for this very reason. There certainly seems to be an overwhelming amount of data in the various logs! Cheers -- Frank [1] This could be (3+1) * 4 where in one instance all 3+1 happen to be the same. Does ZFS really read all 4 copies 4 times (by fmdump timestamp, 8 within 1uS, 40mS later, another 8, again within 1uS)? Not sure what the fmdump timestamps mean, so it''s hard to find any pattern.
On Mar 22, 2010, at 4:21 PM, Frank Middleton wrote:> On 03/21/10 03:24 PM, Richard Elling wrote: > >> I feel confident we are not seeing a b0rken drive here. But something is >> clearly amiss and we cannot rule out the processor, memory, or controller. > > Absolutely no question of that, otherwise this list would be flooded :-). > However, the purpose of the post wasn''t really to diagnose the hardware > but to ask about the behavior of ZFS under certain error conditions. > >> Frank reports that he sees this on the same file, /lib/libdlpi.so.1, so I''ll go out >> on a limb and speculate that there is something in the bit pattern for that >> file that intermittently triggers a bit flip on this system. I''ll also speculate that >> this error will not be reproducible on another system. > > Hopefully not, but you never know :-). However, this instance is different. > The example you quote shows both expected and actual checksums to be > the same.Look again, the checksums are different.> This time the expected and actual checksums are different and > fmdump isn''t flagging any bad_ranges or set-bits (the behavior you observed > is still happening, but orthogonal to this instance at different times and not > always on this file).don''t forget the -V flag :-)> Since file itself is OK, and the expected checksums are always the same, > neither the file nor the metatdata appear to be corrupted, so it appears > that both are making it into memory without error. > > It would seem therefore that it is the actual checksum calculation that is > failing. But, only at boot time, the calculated (bad) checksums differ (out > of 16, 10, 3, and 3 are the same [1]) so it''s not consistent. At this point it > would seem to be cpu or memory, but why only at boot? IMO it''s an > old and feeble power supply under strain pushing cpu or memory to a > margin not seen during "normal" operation, which could be why diagnostics > never see anything amiss (and the importance of a good power supply). > > FWIW the machine passed everything vts could throw at it for a couple > of days. Anyone got any suggestions for more targeted diagnostics? > > There were several questions embedded in the original post, and I''m not > sure any of them have really been answered: > > o Why is the file flagged by ZFS as fatally corrupted still accessiible? > [is this new behavior from b111b vs b125?]. > > o What possible mechanism could there be for the /calculated/ checksums > of /four/ copies of just one specific file to be bad and no others?Broken CPU, HBA, bus, or memory.> o Why did this only happen at boot to just this one file which also is > peculiarly subject to the bitflips you observed, also mostly at boot > (sometimes at scrub)? I like the feeble power supply answer, but why > just this one file? Bizarre...Broken CPU, HBA, bus, memory, or power supply.> # zpool get failmode rpool > NAME PROPERTY VALUE SOURCE > rpool failmode wait default > > This machine is extremely memory limited, so I suspect that libdlpi.so.1 is > not in a cache. Certainly, a brand new copy wouldn''t be, and there''s no > problem writing and (much later) reading the new copy (or the old one, > for that matter). It remains to be seen if the brand new copy gets clobbered > at boot (the machine, for all it''s faults, remains busily up and operational > for months at a time). Maybe I should schedule a reboot out of curiosity :-). > >> This sort of specific error analysis is possible after b125. See CR6867188 >> for more details. > > Wasn''t this in b125? IIRC we upgraded to b125 for this very reason. There > certainly seems to be an overwhelming amount of data in the various logs! > > Cheers -- Frank > > [1] This could be (3+1) * 4 where in one instance all 3+1 happen to be the > same. Does ZFS really read all 4 copies 4 times (by fmdump timestamp, 8 > within 1uS, 40mS later, another 8, again within 1uS)? Not sure what the > fmdump timestamps mean, so it''s hard to find any pattern.Transient failures are some of the most difficult to track down. Not all transient failures are random. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
On 03/22/10 11:50 PM, Richard Elling wrote:> Look again, the checksums are different.Whoops, you are correct, as usual. Just 6 bits out of 256 different... Last year expected 4a027c11b3ba4cec bf274565d5615b7b 3ef5fe61b2ed672e ec8692f7fd33094a actual 4a027c11b3ba4cec bf274567d5615b7b 3ef5fe61b2ed672e ec86a5b3fd33094a Last Month (obviously a different file) expected 4b454eec8aebddb5 3b74c5235e1963ee c4489bdb2b475e76 fda3474dd1b6b63f actual 4b454eec8aebddb5 3b74c5255e1963ee c4489bdb2b475e76 fda354c1d1b6b63f Look which bits are different - digits 24, 53-56 in both cases. But comparing the bits, there''s no discernible pattern. Is this an artifact of the algorithm made by one erring bit always being at the same offset?> don''t forget the -V flag :-)I didn''t. As mentioned there are subsequent set-bit errors, (14 minutes later) but none for this particular incident. I''ll send you the results separately since they are so puzzling. These 16 checksum failures on libdlpi.so.1 were the only fmdump -eV entries for the entire boot sequence except that it started out with one ereport.fs.zfs.data, whatever that is, for a total of exactly 17 records, 9 in 1 uS, then 8 more 40 mS later, also in 1uS. Then nothing for 4 minutes, one more checksum failure ("bad_range_sets =") then 10 minutes later, two with the set-bits error, one for each disk. That''s it.>> o Why is the file flagged by ZFS as fatally corrupted still accessible?This is the part I was hoping to get answers for since AFAIK this should be impossible. Since none of this is having any operational impact, all of these issues are of interest only, but this is a bit scary!> Broken CPU, HBA, bus, memory, or power supply.No argument there. Doesn''t leave much, does it :-). Since the file itself appears to be uncorrupted, and the metadata is consistent for all 16 entries, it would seem that the checksum calculation itself is failing because it would appear in this case that everything else is OK. Is there a way to apply the fletcher2 algorithm interactively as in sum(1) or cksum(1) (i.e., outside the scope of ZFS) to see if it is in some way pattern sensitive with this CPU? Since only a small subset of files is affected, this should be easy to verify. Start a scrub to heat things up and then in parallel do checksums in a tight loop...> Transient failures are some of the most difficult to track down. Not all > transient failures are random.Indeed, although this doesn''t seem to be random. The hits to libdlpi.so.1 seems to be quite reproducible as you''ve seen from the fmdump log, although I doubt this particular scenario will happen again. Can you think of any tools to investigate this? I suppose I could extract the checksum code from ZFS itself to build one, but that would take quite a lot of time. Is there any documentation that explains the output of fmdump -eV? What are set-bits, for example? I guess not... from man fmdump(1m) The error log file contains /Private/ telemetry informa- tion used by Sun''s automated diagnosis software. ...... Each problem recorded in the fault log is identified by: o The time of its diagnosis So did ZFS really read 8 copies of libdlpi.so.1 within 1uS, wait 40mS and then read another 8 copies in 1uS again? I doubt it :-). I bet it took > 1uS just to (mis)calculate the checksum (1.6GHz 16 bit cpu). Thanks -- Frank
On Tue, Mar 23, 2010 at 07:22:59PM -0400, Frank Middleton wrote:> On 03/22/10 11:50 PM, Richard Elling wrote: > >> Look again, the checksums are different. > > Whoops, you are correct, as usual. Just 6 bits out of 256 different... > > Look which bits are different - digits 24, 53-56 in both cases.This is very likely an error introduced during the calculation of the hash, rather than an error in the input data. I don''t know how that helps narrow down the source of the problem, though.. It suggests an experiment: try switching to another hash algorithm. It may move the problem around, or even make it worse, of course. I''m also reminded of a thread about the implementation of fletcher2 being flawed, perhaps you''re better switching regardless.>>> o Why is the file flagged by ZFS as fatally corrupted still accessible? > > This is the part I was hoping to get answers for since AFAIK this > should be impossible. Since none of this is having any operational > impact, all of these issues are of interest only, but this is a bit scary!It''s only the blocks with bad checksums that should return errors. Maybe you''re not reading those, or the transient error doesn''t happen next time when you actually try to read it / from the other side of the mirror. Repeated errors in the same file could also be a symptom of an error calculating the hash when the file was written. If there''s a bit-flipping issue at the root of it, with some given probability, that would invert the probabilities of "correct" and "error" results. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100324/ddfff0cb/attachment.bin>
You could try copying the file to /tmp (ie swap/ram) and do a continues loop of checksums e.g. while [ ! -f ibdlpi.so.1.x ] ; do sleep 1; cp libdlpi.so.1 libdlpi.so.1.x ; A="`sha512sum -b libdlpi.so.1.x`" ; [ "$A" == "what it should be libdlpi.so.1.x" ] && rm libdlpi.so.1.x ; done ; date Assume the file never goes to swap, it would tell you if something on the motherboard is playing up. I have seen CPU randomly set a byte to 0 which should not be 0, think it was an L1 or L2 cache problem. -- This message posted from opensolaris.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 How about running memtest86+ (http://www.memtest.org/) on the machine for a while? It doesn''t test the arithmetics on the CPU very much, but it stresses data paths quite a lot. Just a quick suggestion... - -- Saso Damon Atkins wrote:> You could try copying the file to /tmp (ie swap/ram) and do a continues loop of checksums e.g. > > while [ ! -f ibdlpi.so.1.x ] ; do sleep 1; cp libdlpi.so.1 libdlpi.so.1.x ; A="`sha512sum -b libdlpi.so.1.x`" ; [ "$A" == "what it should be libdlpi.so.1.x" ] && rm libdlpi.so.1.x ; done ; date > > Assume the file never goes to swap, it would tell you if something on the motherboard is playing up. > > I have seen CPU randomly set a byte to 0 which should not be 0, think it was an L1 or L2 cache problem.-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkuqHm8ACgkQRO8UcfzpOHD9PQCgyehtxeAt8tieOlIKfHICQQI9 bFoAnRGzfWayNDsjHj5NdF+5n++Pdqaq =cru5 -----END PGP SIGNATURE-----
you could also use psradm to take a CPU off-line. At boot I would ??assume?? the system boots the same way every time unless something changes, so you could be hiting the came CPU core every time or the same bit of RAM until booted fully. Or even run SunVTS "Validation Test Suite" which I belive has a simlar test to the cp in /tmp and all the other tests it has. -- This message posted from opensolaris.org
On Mar 23, 2010, at 11:21 PM, Daniel Carosone wrote:> On Tue, Mar 23, 2010 at 07:22:59PM -0400, Frank Middleton wrote: >> On 03/22/10 11:50 PM, Richard Elling wrote: >> >>> Look again, the checksums are different. >> >> Whoops, you are correct, as usual. Just 6 bits out of 256 different... >> >> Look which bits are different - digits 24, 53-56 in both cases. > > This is very likely an error introduced during the calculation of > the hash, rather than an error in the input data. I don''t know how > that helps narrow down the source of the problem, though..The exact same code is used to calculate the checksum when writing or reading. However, we assume the processor works and Frank''s tests do not indicate otherwise.> > It suggests an experiment: try switching to another hash algorithm. > It may move the problem around, or even make it worse, of course. > > I''m also reminded of a thread about the implementation of fletcher2 > being flawed, perhaps you''re better switching regardless.Clearly, fletcher2 identified the problem. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
Thanks to everyone who made suggestions! This machine has run memtest for a week and VTS for several days with no errors. It does seem that the problem is probably in the CPU cache. On 03/24/10 10:07 AM, Damon Atkins wrote:> You could try copying the file to /tmp (ie swap/ram) and do a > continues loop of checksumsOn a variation of your suggestion, I implemented a bash script that applies sha1sum 10,000 times with a pause of 0.1S between each attempt, and tests the result against what seemed to be the correct result. sha1sum on /lib/libdlpi.so.1 resulted in 11% of incorrect results sha1sum on /tmp/libdlpi.so.1 resulted in 5 failures out of 10,000 sha1sum on /lib/libpam.so.1 resulted in zero errors in 10,000 sha1sum on /tmp/libpam.so.1ditto. So what we have is a pattern sensitive failure that is also sensitive to how busy the cpu is (and doesn''t fail running VTS). md5sum and sha256sum produced similar results, and presumably so would fletcher2. To get really meaningful results, the machine should be otherwise idle (but then, maybe it wouldn''t fail). Is anyone willing to speculate (or have any suggestions for further experiments) about what failure mode could cause a checksum calculation to be pattern sensitive and also thousands of times more likely to fail if read from disk vs. tmpfs? FWIW the failures are pretty consistent, mostly but not always producing the same bad checksum. So at boot, the cpu is busy, increasing the probability of this pattern sensitive failure, and this one time it failed on every read of /lib/libdlpi.so.1. With copies=1 this was twice as likely to happen, and when it did ZFS returned an error on any attempt to read the file. With copies=2 in this case it doesn''t return an error when attempting to read. Also there were no set-bit errors this time, but then I have no idea what a set-bit error is. On 03/24/10 12:32 PM, Richard Elling wrote:> Clearly, fletcher2 identified the problem.Ironically, on this hardware it seems it created the problem :-). However you have been vindicated - it was a pattern sensitive problem as you have long suggested it might be. So: that the file is still readable is a mystery, but how it became to be flagged as bad in ZFS isn''t, any more. Cheers -- Frank