thr3ads.net - zfs discuss - [zfs-discuss] Bad results from importing a pool on two machines at once [Jun 2008]

If this information is useful, please help other people find it:
Share via:

Chris Siebenmann

2008-Jun-03 15:51 UTC

[zfs-discuss] Bad results from importing a pool on two machines at once

As part of testing for our planned iSCSI + ZFS NFS server environment,
I wanted to see what would happen if I imported a ZFS pool on two
machines at once (as might happen someday in, for example, a failover
scenario gone horribly wrong).

 What I expected was something between a pool with damage and a pool
that was unrecoverable. What I appear to have got is a a ZFS pool
that panics the system whenever you try to import it. The panic is a
''bad checksum (read on <unknown> off 0: ... [L0 packed
nvlist]'' error
from zfs:zfsctl_ops_root (I''ve put the whole thing at the end of this
message).

 I got this without doing very much to the dual-imported pool:
	- import on both systems (-f''ing on one)
	- read a large file a few times on both systems
	- zpool export on one system
	- zpool scrub on the other; system panics
	- zpool import now panics either system

 One system was Solaris 10 U4 server with relatively current patches;
the other was Solaris 10 U5 with current patches.  (Both 64-bit x86.)

 What appears to be the same issue was reported back in April 2007 on
the mailing list, in the message
http://mail.opensolaris.org/pipermail/zfs-discuss/2007-April/039238.html,
but I don''t see any followups.

 Is this a known and filed bug? Is there any idea when it might be fixed
(or the fix appear in Solaris 10)?

 I have to say that I''m disappointed with ZFS''s behavior here;
I don''t
expect a filesystem that claims to have all sorts of checksums and
survive all sorts of disk corruptions to *ever* panic because it
doesn''t
like the data on the disk. That is very definitely not ''surviving disk
corruption'', especially since it seems to have happened to someone who
was not doing violence to their ZFS pools the way I was.

	- cks
[The full panic:
Jun  3 11:05:14 sansol2 genunix: [ID 809409 kern.notice] ZFS: bad checksum (read
on <unknown> off 0: zio ffffffff8e508340 [L0 packed nvlist] 4000L/600P
DVA[0]=<0:a8000c000:600> DVA[1]=<0:1040003000:600> fletcher4 lzjb LE
contiguous birth=119286 fill=1
cksum=6e160f6970:632da4719324:3057ff16f69527:10e6e1af42eb9b10): error 50
Jun  3 11:05:14 sansol2 unix: [ID 100000 kern.notice] 
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9dac0
zfs:zfsctl_ops_root+3003724c ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9dad0
zfs:zio_next_stage+65 ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9db00
zfs:zio_wait_for_children+49 ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9db10
zfs:zio_wait_children_done+15 ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9db20
zfs:zio_next_stage+65 ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9db60
zfs:zio_vdev_io_assess+84 ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9db70
zfs:zio_next_stage+65 ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9dbd0
zfs:vdev_mirror_io_done+c1 ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9dbe0
zfs:zio_vdev_io_done+14 ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9dc60
genunix:taskq_thread+bc ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9dc70
unix:thread_start+8 ()
]

zfs discuss - Jun 2008 - Bad results from importing a pool on two machines at once

[zfs-discuss] Bad results from importing a pool on two machines at once