Hello ZFS mailinglist,
We are using ZFS (over ISCSI) on Opensolaris build 57
Today we encountered 2 crashes during a ZFS send/receive operation.
We tried to replicate a snapshot via the built-in send receive zfs  
tools.
When we analyzed the resulted crash dump files we found the crash was  
ZFS related.
bash-3.00#  mdb -k unix.0 vmcore.0
Loading modules: [ unix genunix specfs dtrace cpu.AuthenticAMD.15  
uppc pcplusmp scsi_vhci ufs md ip hook neti sctp arp usba fctl nca  
lofs zfs random sppp cpc fcip crypto fcp logindmux ptm ipc nfs ]
 > ::status
debugging crash dump vmcore.0 (64-bit) from NAS002
operating system: 5.11 snv_57 (i86pc)
panic message:
ZFS: bad checksum (read on <unknown> off 0: zio ffffffff3017b300 [L0  
ZFS plain file] 20000L/20000P DVA[0]=<0:3b98ed1e800:25800> fletcher2  
uncompressed LE contiguous birth=806063 fill=1 cksum=a487e32d
dump content: kernel pages only
So we ran a zpool status -v,  and found that this snapshot (the one  
we were trying to replicate) had some (28) permanent errors:
bash-3.00# zpool status -v
   pool: home
  state: ONLINE
  scrub: none requested
config:
         NAME        STATE     READ WRITE CKSUM
         home        ONLINE       0     0     0
           mirror    ONLINE       0     0     0
             c1d0s7  ONLINE       0     0     0
             c2d0s7  ONLINE       0     0     0
errors: No known data errors
   pool: stor
  state: ONLINE
status: One or more devices has experienced an error resulting in data
         corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
         entire pool from backup.
    see: http://www.sun.com/msg/ZFS-8000-8A
  scrub: none requested
config:
         NAME        STATE     READ WRITE CKSUM
         stor        ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c4t2d0  ONLINE       0     0     0
             c4t3d0  ONLINE       0     0     0
             c4t4d0  ONLINE       0     0     0
             c4t5d0  ONLINE       0     0     0
             c4t6d0  ONLINE       0     0     0
             c4t7d0  ONLINE       0     0     0
             c4t8d0  ONLINE       0     0     0
errors: Permanent errors have been detected in the following files:
         stor/onlinebackup at 2007-01-08_01:01:00:/1003/kreos11/HB1030/ 
C_Root/Documents and Settings/bvp/My Documents/My Pictures/ 
confidential/tconfidential/confidential/96
         stor/onlinebackup at 2007-01-08_01:01:00:/1003/kreos11/HB1030/ 
C_Root/Documents and Settings/bvp/My Documents/My Pictures// 
confidential/tconfidential/confidential/97
....
Son we decided to destroy this snapshot, and then started another  
Replication.
This time the server crashed again :-(
What can we do to avoid this kind of problems. Or is this a know bug  
in build 57 ?
Thanks for you reply.
Kristof
Opensolaris Aserver wrote:> We tried to replicate a snapshot via the built-in send receive zfs tools....> ZFS: bad checksum (read on <unknown> off 0: zio ffffffff3017b300 [L0 ZFS > plain file] 20000L/20000P DVA[0]=<0:3b98ed1e800:25800> fletcher2 > uncompressed LE contiguous birth=806063 fill=1 cksum=a487e32d...> errors: Permanent errors have been detected in the following files: > > stor/onlinebackup at 2007-01-08_01:01:00:/1003/kreos11/HB1030/C_Root/Documents > and Settings/bvp/My Documents/My > Pictures/confidential/tconfidential/confidential/96...> Son we decided to destroy this snapshot, and then started another > Replication. > > This time the server crashed again :-(So, some of your data has been lost due to hardware failure, where the hardware has "silently" corrupted your data. ZFS has detected this. If you were to read this data (other than via ''zfs send''), you will get EIO, and as you note, ''zfs status'' shows what files are affected. The ''zfs send'' protocol isn''t able to tell the other side "this part of this file is corrupt", so it panics. This is a bug. The reason you''re seeing the panic when ''zfs send''-ing the next snapshot is that the (corrupt) data is shared between multiple snapshots. You can work around this by deleting or overwriting the files, then taking and sending a new snapshot. --matt