Chris Siebenmann
2008-Jun-03 15:51 UTC
[zfs-discuss] Bad results from importing a pool on two machines at once
As part of testing for our planned iSCSI + ZFS NFS server environment, I wanted to see what would happen if I imported a ZFS pool on two machines at once (as might happen someday in, for example, a failover scenario gone horribly wrong). What I expected was something between a pool with damage and a pool that was unrecoverable. What I appear to have got is a a ZFS pool that panics the system whenever you try to import it. The panic is a ''bad checksum (read on <unknown> off 0: ... [L0 packed nvlist]'' error from zfs:zfsctl_ops_root (I''ve put the whole thing at the end of this message). I got this without doing very much to the dual-imported pool: - import on both systems (-f''ing on one) - read a large file a few times on both systems - zpool export on one system - zpool scrub on the other; system panics - zpool import now panics either system One system was Solaris 10 U4 server with relatively current patches; the other was Solaris 10 U5 with current patches. (Both 64-bit x86.) What appears to be the same issue was reported back in April 2007 on the mailing list, in the message http://mail.opensolaris.org/pipermail/zfs-discuss/2007-April/039238.html, but I don''t see any followups. Is this a known and filed bug? Is there any idea when it might be fixed (or the fix appear in Solaris 10)? I have to say that I''m disappointed with ZFS''s behavior here; I don''t expect a filesystem that claims to have all sorts of checksums and survive all sorts of disk corruptions to *ever* panic because it doesn''t like the data on the disk. That is very definitely not ''surviving disk corruption'', especially since it seems to have happened to someone who was not doing violence to their ZFS pools the way I was. - cks [The full panic: Jun 3 11:05:14 sansol2 genunix: [ID 809409 kern.notice] ZFS: bad checksum (read on <unknown> off 0: zio ffffffff8e508340 [L0 packed nvlist] 4000L/600P DVA[0]=<0:a8000c000:600> DVA[1]=<0:1040003000:600> fletcher4 lzjb LE contiguous birth=119286 fill=1 cksum=6e160f6970:632da4719324:3057ff16f69527:10e6e1af42eb9b10): error 50 Jun 3 11:05:14 sansol2 unix: [ID 100000 kern.notice] Jun 3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9dac0 zfs:zfsctl_ops_root+3003724c () Jun 3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9dad0 zfs:zio_next_stage+65 () Jun 3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9db00 zfs:zio_wait_for_children+49 () Jun 3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9db10 zfs:zio_wait_children_done+15 () Jun 3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9db20 zfs:zio_next_stage+65 () Jun 3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9db60 zfs:zio_vdev_io_assess+84 () Jun 3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9db70 zfs:zio_next_stage+65 () Jun 3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9dbd0 zfs:vdev_mirror_io_done+c1 () Jun 3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9dbe0 zfs:zio_vdev_io_done+14 () Jun 3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9dc60 genunix:taskq_thread+bc () Jun 3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fffffe8000f9dc70 unix:thread_start+8 () ]