thr3ads.net - zfs discuss - [zfs-discuss] CR 6880994 and pkg fix [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Frank Middleton

2010-Mar-14 18:54 UTC

[zfs-discuss] CR 6880994 and pkg fix

Can anyone say what the status of CR 6880994 (kernel/zfs Checksum failures on
mirrored drives) might be?

Setting copies=2 has mitigated the problem, which manifests itself consistently
at
boot by flagging libdlpi.so.1, but two recent power cycles in a row with no
normal
shutdown has resulted in a "permanent error" even with copies=2 on all
of the root
pool (and specifically having duplicated /lib to make sure there are 2 copies).

How can it even be remotely possible to get a checksum failure on mirrored
drives
with copies=2? That means all four copies were corrupted? Admittedly this is
on a grotty PC with no ECC and flaky bus parity, but how come the same file
always
gets flagged as being clobbered (even though apparently it isn''t).

The oddest part is that libdlpi.so.1 doesn''t actually seem to be
corrupted. nm lists
it with no problem and you can copy it to /tmp, rename it, and then copy it
back.
objdump and readelf can all process this library with no problem. But "pkg
fix"
flags an error in it''s own inscrutable way. CCing pkg-discuss in case a
pkg guru
can shed any light on what the output of "pkg fix" (below) means.
Presumably libc
is OK, or it wouldn''t boot :-).

This with b125 on X86.

# zpool status -v
   pool: rpool
  state: ONLINE
status: One or more devices has experienced an error resulting in data
         corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
         entire pool from backup.
    see: http://www.sun.com/msg/ZFS-8000-8A
  scrub: none requested
config:

         NAME        STATE     READ WRITE CKSUM
         rpool       ONLINE       0     0     0
           mirror-0  ONLINE       0     0     2
             c3d1s0  ONLINE       0     0     2
             c3d0s0  ONLINE       0     0     2

errors: Permanent errors have been detected in the following files:

         //lib/libdlpi.so.1

# pkg fix  SUNWcsl
Verifying: pkg://opensolarisdev/SUNWcsl                         ERROR
         file: lib/libc.so.1
                 Elfhash: cbb55a2ea24db9e03d9cd08c25b20406896c2fef should be
0e73a56d6ea0753f3721988ccbd716e370e57c4e
Created ZFS snapshot: 2010-03-13-23:39:17         ..... ||
Repairing: pkg://opensolarisdev/SUNWcsl
pkg: Requested "fix" operation would affect files that cannot be
modified in live image.
Please retry this operation on an alternate boot environment

# nm /lib/libdlpi.so.1
00015562 b Bbss.bss
00015562 b Bbss.bss
00015240 d Ddata.data
00015240 d Ddata.data
000152f8 d Dpicdata.picdata
00003ca8 r Drodata.rodata
00003ca0 r Drodata.rodata
00000000 A SUNW_1.1
00000000 A SUNWprivate
000150ac D _DYNAMIC
00015562 b _END_
00015000 D _GLOBAL_OFFSET_TABLE_
000016c0 T _PROCEDURE_LINKAGE_TABLE_
00000000 r _START_
          U ___errno
          U __ctype
          U __div64
00015562 D _edata
00015562 B _end
000043d7 R _etext
00003c84 t _fini
          U _fxstat
00003c68 t _init
00003ca0 r _lib_version
          U _lxstat
          U _xmknod
          U _xstat
          U abs
          U calloc
          U close
          U closedir
          U dgettext
          U dladm_close
          U dladm_dev2linkid
          U dladm_open
          U dladm_parselink
          U dladm_phys_info
          U dladm_walk
00002d5c T dlpi_arptype
0000222c T dlpi_bind
00001d6c T dlpi_close
000024d0 T dlpi_disabmulti
00002c78 T dlpi_disabnotify
000024b0 T dlpi_enabmulti
00002af4 T dlpi_enabnotify
00015288 d dlpi_errlist
00002ce4 T dlpi_fd
000025b8 T dlpi_get_physaddr
00002e50 T dlpi_iftype
00001dc8 T dlpi_info
00002d2c T dlpi_linkname
000039a8 T dlpi_mactype
000152f8 d dlpi_mactypes
000021a0 T dlpi_makelink
00001b00 T dlpi_open
00002158 T dlpi_parselink
00003ca8 r dlpi_primsizes
00002598 T dlpi_promiscoff
00002578 T dlpi_promiscon
000028fc T dlpi_recv
000027a4 T dlpi_send
000026d4 T dlpi_set_physaddr
00002d04 T dlpi_set_timeout
00003908 T dlpi_strerror
00002d48 T dlpi_style
00002384 T dlpi_unbind
00001a20 T dlpi_walk
          U free
00001998 t fstat
          U getenv
          U gethrtime
          U getmsg
000032fc t i_dlpi_attach
00003a28 t i_dlpi_buildsap
000032ac t i_dlpi_checkstyle
00003bfc t i_dlpi_deletenotifyid
000039e8 t i_dlpi_getprimsize
00003868 t i_dlpi_msg_common
000023f4 t i_dlpi_multi
00003bd4 t i_dlpi_notifyidexists
00003ac8 t i_dlpi_notifyind_process
00002f28 t i_dlpi_open
00003384 t i_dlpi_passive
000024e8 t i_dlpi_promisc
00003460 t i_dlpi_strgetmsg
000033e4 t i_dlpi_strputmsg
0000316c t i_dlpi_style1_open
000031f0 t i_dlpi_style2_open
000019f0 t i_dlpi_walk_link
00003a9c t i_dlpi_writesap
          U ifparse_ifspec
          U ioctl
00015240 d libdlpi_errlist
0000196c t lstat
          U memcpy
          U memset
000019c4 t mknod
          U open
          U opendir
          U poll
          U putmsg
          U readdir
          U snprintf
00001940 t stat
          U strchr
          U strerror
          U strlcpy
          U strlen

Danek Duvall

2010-Mar-15 16:11 UTC

head link

[zfs-discuss] [pkg-discuss] CR 6880994 and pkg fix

Frank Middleton wrote:
> But "pkg fix" flags an error in it''s own inscrutable
way. CCing
> pkg-discuss in case a pkg guru can shed any light on what the output of
> "pkg fix" (below) means. Presumably libc is OK, or it
wouldn''t boot :-).
The "problem" with libc here is that while /lib/libc.so.1 is delivered
as
one set of contents, /usr/lib/libc/libc_hwcap*.so.1 is mounted on top.
Until bug 7926 is fixed, "pkg verify" doesn''t look underneath
the
mountpoint and so thinks that libc has the wrong bits.

Danek

David Dyer-Bennet

2010-Mar-15 17:01 UTC

head link

[zfs-discuss] CR 6880994 and pkg fix

On Sun, March 14, 2010 13:54, Frank Middleton wrote:
>
> How can it even be remotely possible to get a checksum failure on mirrored
> drives
> with copies=2? That means all four copies were corrupted? Admittedly this
> is
> on a grotty PC with no ECC and flaky bus parity, but how come the same
> file always
> gets flagged as being clobbered (even though apparently it isn''t).
>
> The oddest part is that libdlpi.so.1 doesn''t actually seem to be
> corrupted. nm lists
> it with no problem and you can copy it to /tmp, rename it, and then copy
> it back.
> objdump and readelf can all process this library with no problem. But
"pkg
> fix"
> flags an error in it''s own inscrutable way. CCing pkg-discuss in
case a
> pkg guru
> can shed any light on what the output of "pkg fix" (below) means.
> Presumably libc
> is OK, or it wouldn''t boot :-).
This sounds really bizarre.

One detail suggestion on checking what''s going on (since I
don''t have a
clue towards a real root-cause determination): Get an md5sum on a clean
copy of the file, say from a new install or something, and check the
allegedly-corrupted copy against that.  This can fairly easily give you a
pretty reliable indication if the file is truly corrupted or not.
-- 
David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

Frank Middleton

2010-Mar-21 18:03 UTC

head link

[zfs-discuss] CR 6880994 and pkg fix

On 03/15/10 01:01 PM, David Dyer-Bennet wrote:
> This sounds really bizarre.
Yes, it is. ButCR 6880994 is bizarre too.
  > One detail suggestion on checking what''s going on (since I
don''t have a
> clue towards a real root-cause determination): Get an md5sum on a clean
> copy of the file, say from a new install or something, and check the
> allegedly-corrupted copy against that.  This can fairly easily give you a
> pretty reliable indication if the file is truly corrupted or not.
With many thanks to Danek Duvall, I got a new copy of libdlpi.so.1

# md5sum /lib/libdlpi.so.1
2468392ff87b5810571572eb572d0a41  /lib/libdlpi.so.1
# md5sum /lib/libdlpi.so.1.orig
2468392ff87b5810571572eb572d0a41  /lib/libdlpi.so.1.orig
# zpool status -v
....
errors: Permanent errors have been detected in the following files:

         //lib/libdlpi.so.1.orig

So here we seem to have an example of a ZFS false positive, the first
I''ve see or heard of. The good news is that it is still possible to
read the
file, so this augers well for the ability to boot under this circumstance.
FWIW fmdump does seem to show show actual checksum errors on
all four copies in 16 attempts to read them. There were 3 groups of
different bad checksums; within each group the checksum was the
same but differed from the expected.

Perhaps someone who can could add this to CR 6880994 in the hopes
that it might help lead to a better understanding.

For the casual reader, CR 6880994 is about a pathological PC that
gets checksum errors on the same set of files at boot, even though the
root pool is mirrored. With copies=2, usually ZFS can repair them. But
after a recent power cycle, all 4 copies reported bad checksums but in
reality the the file seems to be uncorrupted. The machine has no ECC
and flaky bus parity, so there are plenty of ways for the data to get
messed up. It''s a mystery why this only happens at boot, though.

Richard Elling

2010-Mar-21 19:24 UTC

head link

[zfs-discuss] CR 6880994 and pkg fix

On Mar 21, 2010, at 11:03 AM, Frank Middleton wrote:> On 03/15/10 01:01 PM, David Dyer-Bennet wrote:
> 
>> This sounds really bizarre.
> 
> Yes, it is. ButCR 6880994 is bizarre too.
Rolling back to a conversation with Frank last fall, here is the output
of fmdump which shows the single bit flip. Extra lines elided.

TIME                           CLASS
Oct 23 2009 14:53:01.525657508 ereport.fs.zfs.checksum
	class = ereport.fs.zfs.checksum
	pool = rpool
	vdev_guid = 0x509094f6dc795c97
	vdev_type = disk
	vdev_path = /dev/dsk/c3d0s0
	vdev_devid = id1,cmdk at AMaxtor_6Y080L0=Y32HE6XE/a
	parent_guid = 0x323cf9d672c3b05a
	parent_type = mirror
	zio_err = 50
	zio_offset = 0x50384800
	zio_size = 0x9800
	zio_objset = 0x29
	zio_object = 0x1a209
	zio_level = 0
	zio_blkid = 0x0
	cksum_expected = 0x4a027c11b3ba4cec 0xbf274565d5615b7b 0x3ef5fe61b2ed672e
0xec8692f7fd33094a
	cksum_actual = 0x4a027c11b3ba4cec 0xbf274567d5615b7b 0x3ef5fe61b2ed672e
0xec86a5b3fd33094a
	cksum_algorithm = fletcher2
	bad_ranges = 0x228 0x230
	bad_ranges_min_gap = 0x8
	bad_range_sets = 0x1
	bad_range_clears = 0x0
	bad_set_bits = 0x0 0x0 0x0 0x0 0x2 0x0 0x0 0x0
	bad_cleared_bits = 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0

Here we see that one bit was set, 0x2 when we expected 0x0.
Later that same day...

Oct 23 2009 14:53:01.525657152 ereport.fs.zfs.checksum
	class = ereport.fs.zfs.checksum
	pool = rpool
	pool_guid = 0x5062a7a7247652b1
	vdev_guid = 0x1181c8516c0dc9b0
	vdev_type = disk
	vdev_path = /dev/dsk/c3d1s0
	vdev_devid = id1,cmdk at AWDC_WD800BB-00BSA0=WD-WMA6S1025599/a
	parent_guid = 0x323cf9d672c3b05a
	parent_type = mirror
	zio_err = 50
	zio_offset = 0x50384800
	zio_size = 0x9800
	zio_objset = 0x29
	zio_object = 0x1a209
	zio_level = 0
	zio_blkid = 0x0
	cksum_expected = 0x4a027c11b3ba4cec 0xbf274565d5615b7b 0x3ef5fe61b2ed672e
0xec8692f7fd33094a
	cksum_actual = 0x4a027c11b3ba4cec 0xbf274567d5615b7b 0x3ef5fe61b2ed672e
0xec86a5b3fd33094a
	cksum_algorithm = fletcher2
	bad_ranges = 0x228 0x230
	bad_ranges_min_gap = 0x8
	bad_range_sets = 0x1
	bad_range_clears = 0x0
	bad_set_bits = 0x0 0x0 0x0 0x0 0x2 0x0 0x0 0x0
	bad_cleared_bits = 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0

So we see the exact same bit flipped (0x2 expecting 0x0) on two different
disks, /dev/dsk/c3d0s0 (Maxtor) and /dev/dsk/c3d1s0 (Western Digital), at 
the same zio offset and size.

I feel confident we are not seeing a b0rken drive here.  But something is
clearly amiss and we cannot rule out the processor, memory, or controller.
Frank reports that he sees this on the same file, /lib/libdlpi.so.1, so
I''ll go out
on a limb and speculate that there is something in the bit pattern for that 
file that intermittently triggers a bit flip on this system. I''ll also
speculate that
this error will not be reproducible on another system.

This sort of specific error analysis is possible after b125. See CR6867188
for more details.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Frank Middleton

2010-Mar-22 23:21 UTC

head link

[zfs-discuss] CR 6880994 and pkg fix

On 03/21/10 03:24 PM, Richard Elling wrote:
  > I feel confident we are not seeing a b0rken drive here.  But something is
> clearly amiss and we cannot rule out the processor, memory, or controller.
Absolutely no question of that, otherwise this list would be flooded :-).
However, the purpose of the post wasn''t really to diagnose the hardware
but to ask about the behavior of ZFS under certain error conditions.
> Frank reports that he sees this on the same file, /lib/libdlpi.so.1, so
I''ll go out
> on a limb and speculate that there is something in the bit pattern for that
> file that intermittently triggers a bit flip on this system. I''ll
also speculate that
> this error will not be reproducible on another system.
Hopefully not, but you never know :-). However, this instance is different.
The example you quote shows both expected and actual checksums to be
the same. This time the expected and actual checksums are different and
fmdump isn''t flagging any bad_ranges or set-bits (the behavior you
observed
is still happening, but orthogonal to this instance at different times and not
always on this file).

Since file itself is OK, and the expected checksums are always the same,
neither the file nor the metatdata appear to be corrupted, so it appears
that both are making it into memory without error.

It would seem therefore that it is the actual checksum calculation that is
failing. But, only at boot time, the calculated (bad) checksums differ (out
of 16, 10, 3, and 3 are the same [1]) so it''s not consistent. At this
point it
would seem to be cpu or memory, but why only at boot? IMO it''s an
old and feeble power supply under strain pushing cpu or memory to a
margin not seen during "normal" operation, which could be why
diagnostics
never see anything amiss (and the importance of a good power supply).

FWIW the machine passed everything vts could throw at it for a couple
of days. Anyone got any suggestions for more targeted diagnostics?

There were several questions embedded in the original post, and I''m not
sure any of them have really been answered:

o Why is the file flagged by ZFS as fatally corrupted still accessiible?
    [is this new behavior from b111b vs b125?].

o What possible mechanism could there be for the /calculated/ checksums
    of /four/ copies of just one specific file to be bad and no others?

o Why did this only happen at boot to just this one file which also is
    peculiarly subject to the bitflips you observed, also mostly at boot
   (sometimes at scrub)? I like the feeble power supply answer, but why
   just this one file? Bizarre...

# zpool get  failmode rpool
NAME   PROPERTY  VALUE     SOURCE
rpool  failmode  wait      default

This machine is extremely memory limited, so I suspect that libdlpi.so.1 is
not in a cache. Certainly, a brand new copy wouldn''t be, and
there''s no
problem writing and (much later) reading the new copy (or the old one,
for that matter). It remains to be seen if the brand new copy gets clobbered
at boot (the machine, for all it''s faults, remains busily up and
operational
for months at a time). Maybe I should schedule a reboot out of curiosity :-).
> This sort of specific error analysis is possible after b125. See CR6867188
> for more details.
Wasn''t this in b125? IIRC we upgraded to b125 for this very reason.
There
certainly seems to be an overwhelming amount of data in the various logs!

Cheers -- Frank

[1]  This could be (3+1) * 4 where in one instance all 3+1 happen to be the
same. Does ZFS really read all 4 copies 4 times (by fmdump timestamp, 8
within 1uS, 40mS later, another 8,  again within 1uS)? Not sure what the
fmdump timestamps mean, so it''s hard to find any pattern.

Richard Elling

2010-Mar-23 03:50 UTC

head link

[zfs-discuss] CR 6880994 and pkg fix

On Mar 22, 2010, at 4:21 PM, Frank Middleton wrote:
> On 03/21/10 03:24 PM, Richard Elling wrote:
> 
>> I feel confident we are not seeing a b0rken drive here.  But something
is
>> clearly amiss and we cannot rule out the processor, memory, or
controller.
> 
> Absolutely no question of that, otherwise this list would be flooded :-).
> However, the purpose of the post wasn''t really to diagnose the
hardware
> but to ask about the behavior of ZFS under certain error conditions.
> 
>> Frank reports that he sees this on the same file, /lib/libdlpi.so.1, so
I''ll go out
>> on a limb and speculate that there is something in the bit pattern for
that
>> file that intermittently triggers a bit flip on this system.
I''ll also speculate that
>> this error will not be reproducible on another system.
> 
> Hopefully not, but you never know :-). However, this instance is different.
> The example you quote shows both expected and actual checksums to be
> the same.
Look again, the checksums are different.
> This time the expected and actual checksums are different and
> fmdump isn''t flagging any bad_ranges or set-bits (the behavior you
observed
> is still happening, but orthogonal to this instance at different times and
not
> always on this file).
don''t forget the -V flag :-)
> Since file itself is OK, and the expected checksums are always the same,
> neither the file nor the metatdata appear to be corrupted, so it appears
> that both are making it into memory without error.
> 
> It would seem therefore that it is the actual checksum calculation that is
> failing. But, only at boot time, the calculated (bad) checksums differ (out
> of 16, 10, 3, and 3 are the same [1]) so it''s not consistent. At
this point it
> would seem to be cpu or memory, but why only at boot? IMO it''s an
> old and feeble power supply under strain pushing cpu or memory to a
> margin not seen during "normal" operation, which could be why
diagnostics
> never see anything amiss (and the importance of a good power supply).
> 
> FWIW the machine passed everything vts could throw at it for a couple
> of days. Anyone got any suggestions for more targeted diagnostics?
> 
> There were several questions embedded in the original post, and
I''m not
> sure any of them have really been answered:
> 
> o Why is the file flagged by ZFS as fatally corrupted still accessiible?
>   [is this new behavior from b111b vs b125?].
> 
> o What possible mechanism could there be for the /calculated/ checksums
>   of /four/ copies of just one specific file to be bad and no others?
Broken CPU, HBA, bus, or memory.
> o Why did this only happen at boot to just this one file which also is
>   peculiarly subject to the bitflips you observed, also mostly at boot
>  (sometimes at scrub)? I like the feeble power supply answer, but why
>  just this one file? Bizarre...
Broken CPU, HBA, bus, memory, or power supply.
> # zpool get  failmode rpool
> NAME   PROPERTY  VALUE     SOURCE
> rpool  failmode  wait      default
> 
> This machine is extremely memory limited, so I suspect that libdlpi.so.1 is
> not in a cache. Certainly, a brand new copy wouldn''t be, and
there''s no
> problem writing and (much later) reading the new copy (or the old one,
> for that matter). It remains to be seen if the brand new copy gets
clobbered
> at boot (the machine, for all it''s faults, remains busily up and
operational
> for months at a time). Maybe I should schedule a reboot out of curiosity
:-).
> 
>> This sort of specific error analysis is possible after b125. See
CR6867188
>> for more details.
> 
> Wasn''t this in b125? IIRC we upgraded to b125 for this very
reason. There
> certainly seems to be an overwhelming amount of data in the various logs!
> 
> Cheers -- Frank
> 
> [1]  This could be (3+1) * 4 where in one instance all 3+1 happen to be the
> same. Does ZFS really read all 4 copies 4 times (by fmdump timestamp, 8
> within 1uS, 40mS later, another 8,  again within 1uS)? Not sure what the
> fmdump timestamps mean, so it''s hard to find any pattern.
Transient failures are some of the most difficult to track down. Not all 
transient failures are random.
  -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Frank Middleton

2010-Mar-23 23:22 UTC

head link

[zfs-discuss] CR 6880994 and pkg fix

On 03/22/10 11:50 PM, Richard Elling wrote:
  > Look again, the checksums are different.
Whoops, you are correct, as usual. Just 6 bits out of 256 different...
Last year
expected 4a027c11b3ba4cec bf274565d5615b7b 3ef5fe61b2ed672e ec8692f7fd33094a
actual      4a027c11b3ba4cec bf274567d5615b7b 3ef5fe61b2ed672e ec86a5b3fd33094a
Last Month (obviously a different file)
expected 4b454eec8aebddb5 3b74c5235e1963ee c4489bdb2b475e76 fda3474dd1b6b63f
actual      4b454eec8aebddb5 3b74c5255e1963ee c4489bdb2b475e76 fda354c1d1b6b63f

Look which bits are different -  digits 24, 53-56 in both cases. But comparing
the bits, there''s no discernible pattern. Is this an artifact of the
algorithm
made by one erring bit always being at the same offset?
> don''t forget the -V flag :-)
I didn''t. As mentioned there are subsequent set-bit errors, (14 minutes
later)  but none for this particular incident. I''ll send you the
results
separately since they are so puzzling. These 16 checksum failures
on libdlpi.so.1 were the only fmdump -eV entries for the entire boot
sequence except that it started out with one ereport.fs.zfs.data,
whatever that is, for a total of exactly 17 records, 9 in 1 uS, then
8 more 40 mS later, also in 1uS. Then nothing for 4 minutes, one
more checksum failure ("bad_range_sets =") then 10 minutes later,
two with the set-bits error, one for each disk. That''s it.
>> o Why is the file flagged by ZFS as fatally corrupted still accessible?
This is the part I was hoping to get answers for since AFAIK this
should be impossible. Since none of this is having any operational
impact, all of these issues are of interest only, but this is a bit scary!
> Broken CPU, HBA, bus, memory, or power supply.
No argument there. Doesn''t leave much, does it :-). Since the file
itself
appears to be uncorrupted, and the metadata is consistent for all 16
entries, it would seem that the checksum calculation itself is failing
because it would appear in this case that everything else is OK. Is there
a way to apply the fletcher2 algorithm interactively as in sum(1)
or cksum(1)  (i.e., outside the scope of ZFS) to see if it is in some way
pattern sensitive with this CPU? Since only a small subset of files is
affected, this should be easy to verify. Start a scrub to heat things
up and then in parallel do checksums in a tight loop...
> Transient failures are some of the most difficult to track down. Not all
> transient failures are random.
Indeed, although this doesn''t seem to be random. The hits to
libdlpi.so.1
seems to be quite reproducible as you''ve seen from the fmdump log,
although I doubt this particular scenario will happen again. Can you
think of any tools to investigate this? I suppose I could extract the
checksum code from ZFS itself to build one, but that would take quite
a lot of time. Is there any documentation that explains the output of
fmdump -eV? What are set-bits, for example?

I guess not...  from man fmdump(1m)

        The error log file contains /Private/  telemetry  informa-
          tion  used  by  Sun''s automated diagnosis software.
......

    Each problem recorded in the fault log is identified by:

          o    The time of its diagnosis

So did ZFS really read 8 copies of libdlpi.so.1 within 1uS, wait
40mS and then read another 8 copies in 1uS again? I doubt it :-).
I bet it took > 1uS just to (mis)calculate the checksum (1.6GHz
16 bit cpu).

Thanks -- Frank

Daniel Carosone

2010-Mar-24 06:21 UTC

head link

[zfs-discuss] CR 6880994 and pkg fix

On Tue, Mar 23, 2010 at 07:22:59PM -0400, Frank Middleton
wrote:> On 03/22/10 11:50 PM, Richard Elling wrote:
>  
>> Look again, the checksums are different.
>
> Whoops, you are correct, as usual. Just 6 bits out of 256 different...
>
> Look which bits are different -  digits 24, 53-56 in both cases.
This is very likely an error introduced during the calculation of
the hash, rather than an error in the input data.  I don''t know how
that helps narrow down the source of the problem, though..

It suggests an experiment: try switching to another hash algorithm.
It may move the problem around, or even make it worse, of course.

I''m also reminded of a thread about the implementation of fletcher2
being flawed, perhaps you''re better switching regardless.
>>> o Why is the file flagged by ZFS as fatally corrupted still
accessible?
>
> This is the part I was hoping to get answers for since AFAIK this
> should be impossible. Since none of this is having any operational
> impact, all of these issues are of interest only, but this is a bit scary!
It''s only the blocks with bad checksums that should return errors.
Maybe you''re not reading those, or the transient error doesn''t
happen
next time when you actually try to read it / from the other side of
the mirror.

Repeated errors in the same file could also be a symptom of an error
calculating the hash when the file was written.  If there''s a
bit-flipping issue at the root of it, with some given probability,
that would invert the probabilities of "correct" and "error"
results.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100324/ddfff0cb/attachment.bin>

Damon Atkins

2010-Mar-24 14:07 UTC

head link

[zfs-discuss] CR 6880994 and pkg fix

You could try copying the file to /tmp (ie swap/ram) and do a continues loop of
checksums  e.g.

while [ ! -f  ibdlpi.so.1.x ] ; do sleep 1; cp libdlpi.so.1 libdlpi.so.1.x ;
A="`sha512sum -b libdlpi.so.1.x`" ; [ "$A" == "what it
should be libdlpi.so.1.x" ] && rm libdlpi.so.1.x ; done ; date

Assume the file never goes to swap, it would tell you if something on the
motherboard is playing up.

I have seen CPU randomly set a byte to 0 which should not be 0, think it was an
L1 or L2 cache problem.
-- 
This message posted from opensolaris.org

Saso Kiselkov

2010-Mar-24 14:15 UTC

head link

[zfs-discuss] CR 6880994 and pkg fix

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

How about running memtest86+ (http://www.memtest.org/) on the machine
for a while? It doesn''t test the arithmetics on the CPU very much, but
it stresses data paths quite a lot. Just a quick suggestion...

- --
Saso

Damon Atkins wrote:> You could try copying the file to /tmp (ie swap/ram) and do a continues
loop of checksums  e.g.
> 
> while [ ! -f  ibdlpi.so.1.x ] ; do sleep 1; cp libdlpi.so.1 libdlpi.so.1.x
; A="`sha512sum -b libdlpi.so.1.x`" ; [ "$A" == "what
it should be libdlpi.so.1.x" ] && rm libdlpi.so.1.x ; done ; date
> 
> Assume the file never goes to swap, it would tell you if something on the
motherboard is playing up.
> 
> I have seen CPU randomly set a byte to 0 which should not be 0, think it
was an L1 or L2 cache problem.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkuqHm8ACgkQRO8UcfzpOHD9PQCgyehtxeAt8tieOlIKfHICQQI9
bFoAnRGzfWayNDsjHj5NdF+5n++Pdqaq
=cru5
-----END PGP SIGNATURE-----

Damon Atkins

2010-Mar-24 14:19 UTC

head link

[zfs-discuss] CR 6880994 and pkg fix

you could also use psradm to take a CPU off-line.

At boot I would ??assume?? the system boots the same way every time unless
something changes, so you could be hiting the came CPU core every time or the
same bit of RAM until booted fully.

Or even run SunVTS "Validation Test Suite" which I belive has a simlar
test to the cp in /tmp and all the other tests it has.
-- 
This message posted from opensolaris.org

Richard Elling

2010-Mar-24 16:32 UTC

head link

[zfs-discuss] CR 6880994 and pkg fix

On Mar 23, 2010, at 11:21 PM, Daniel Carosone wrote:
> On Tue, Mar 23, 2010 at 07:22:59PM -0400, Frank Middleton wrote:
>> On 03/22/10 11:50 PM, Richard Elling wrote:
>> 
>>> Look again, the checksums are different.
>> 
>> Whoops, you are correct, as usual. Just 6 bits out of 256 different...
>> 
>> Look which bits are different -  digits 24, 53-56 in both cases.
> 
> This is very likely an error introduced during the calculation of
> the hash, rather than an error in the input data.  I don''t know
how
> that helps narrow down the source of the problem, though..
The exact same code is used to calculate the checksum when writing
or reading. However, we assume the processor works and Frank''s tests
do not indicate otherwise.
> 
> It suggests an experiment: try switching to another hash algorithm.
> It may move the problem around, or even make it worse, of course.
> 
> I''m also reminded of a thread about the implementation of
fletcher2
> being flawed, perhaps you''re better switching regardless.
Clearly, fletcher2 identified the problem.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Frank Middleton

2010-Mar-28 21:25 UTC

head link

[zfs-discuss] CR 6880994 and pkg fix

Thanks to everyone who made suggestions! This machine has run
memtest for a week and VTS for several days with no errors. It
does seem that the problem is probably in the CPU cache.

On 03/24/10 10:07 AM, Damon Atkins wrote:> You could try copying the file to /tmp (ie swap/ram) and do a
> continues loop of checksums
On a variation of your suggestion, I implemented a bash script
that applies sha1sum 10,000 times with a pause of 0.1S between
each attempt, and tests the result against what seemed to be the
correct result.

sha1sum on /lib/libdlpi.so.1 resulted in 11% of incorrect results
sha1sum on /tmp/libdlpi.so.1 resulted in 5 failures out of 10,000
sha1sum on /lib/libpam.so.1 resulted in zero errors in 10,000
sha1sum on /tmp/libpam.so.1ditto.

So what we have is a pattern sensitive failure that is also sensitive
to how busy the cpu is (and doesn''t fail running VTS). md5sum and
sha256sum produced similar results, and presumably so would
fletcher2. To get really meaningful results, the machine should be
otherwise idle (but then, maybe it wouldn''t fail).

Is anyone willing to speculate (or have any suggestions for further
experiments) about what failure mode could cause a checksum
calculation to be pattern sensitive and also thousands of times
more likely to fail if read from disk vs. tmpfs? FWIW the failures
are pretty consistent, mostly but not always producing the
same bad checksum.

So at boot, the cpu is busy, increasing the probability of this
pattern sensitive failure, and this one time it failed on every
read of /lib/libdlpi.so.1. With copies=1 this was twice as likely
to happen, and when it did ZFS returned an error on any
attempt to read the file. With copies=2 in this case it doesn''t
return an error when attempting to read. Also there were no
set-bit errors this time, but then I have no idea what a set-bit
error is.

On 03/24/10 12:32 PM, Richard Elling wrote:
> Clearly, fletcher2 identified the problem.
Ironically, on this hardware it seems it created the problem :-).
However you have been vindicated - it was a pattern sensitive
problem as you have long suggested it might be.

So: that the file is still readable is a mystery, but how it became
to be flagged as bad in ZFS isn''t, any more.

Cheers -- Frank

zfs discuss - Mar 2010 - CR 6880994 and pkg fix

[zfs-discuss] CR 6880994 and pkg fix

[zfs-discuss] [pkg-discuss] CR 6880994 and pkg fix

[zfs-discuss] CR 6880994 and pkg fix

[zfs-discuss] CR 6880994 and pkg fix

[zfs-discuss] CR 6880994 and pkg fix

[zfs-discuss] CR 6880994 and pkg fix

[zfs-discuss] CR 6880994 and pkg fix

[zfs-discuss] CR 6880994 and pkg fix

[zfs-discuss] CR 6880994 and pkg fix

[zfs-discuss] CR 6880994 and pkg fix

[zfs-discuss] CR 6880994 and pkg fix

[zfs-discuss] CR 6880994 and pkg fix

[zfs-discuss] CR 6880994 and pkg fix

[zfs-discuss] CR 6880994 and pkg fix