Hello ZFS mailinglist, We are using ZFS (over ISCSI) on Opensolaris build 57 Today we encountered 2 crashes during a ZFS send/receive operation. We tried to replicate a snapshot via the built-in send receive zfs tools. When we analyzed the resulted crash dump files we found the crash was ZFS related. bash-3.00# mdb -k unix.0 vmcore.0 Loading modules: [ unix genunix specfs dtrace cpu.AuthenticAMD.15 uppc pcplusmp scsi_vhci ufs md ip hook neti sctp arp usba fctl nca lofs zfs random sppp cpc fcip crypto fcp logindmux ptm ipc nfs ] > ::status debugging crash dump vmcore.0 (64-bit) from NAS002 operating system: 5.11 snv_57 (i86pc) panic message: ZFS: bad checksum (read on <unknown> off 0: zio ffffffff3017b300 [L0 ZFS plain file] 20000L/20000P DVA[0]=<0:3b98ed1e800:25800> fletcher2 uncompressed LE contiguous birth=806063 fill=1 cksum=a487e32d dump content: kernel pages only So we ran a zpool status -v, and found that this snapshot (the one we were trying to replicate) had some (28) permanent errors: bash-3.00# zpool status -v pool: home state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM home ONLINE 0 0 0 mirror ONLINE 0 0 0 c1d0s7 ONLINE 0 0 0 c2d0s7 ONLINE 0 0 0 errors: No known data errors pool: stor state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM stor ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c4t4d0 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c4t6d0 ONLINE 0 0 0 c4t7d0 ONLINE 0 0 0 c4t8d0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: stor/onlinebackup at 2007-01-08_01:01:00:/1003/kreos11/HB1030/ C_Root/Documents and Settings/bvp/My Documents/My Pictures/ confidential/tconfidential/confidential/96 stor/onlinebackup at 2007-01-08_01:01:00:/1003/kreos11/HB1030/ C_Root/Documents and Settings/bvp/My Documents/My Pictures// confidential/tconfidential/confidential/97 .... Son we decided to destroy this snapshot, and then started another Replication. This time the server crashed again :-( What can we do to avoid this kind of problems. Or is this a know bug in build 57 ? Thanks for you reply. Kristof
Opensolaris Aserver wrote:> We tried to replicate a snapshot via the built-in send receive zfs tools....> ZFS: bad checksum (read on <unknown> off 0: zio ffffffff3017b300 [L0 ZFS > plain file] 20000L/20000P DVA[0]=<0:3b98ed1e800:25800> fletcher2 > uncompressed LE contiguous birth=806063 fill=1 cksum=a487e32d...> errors: Permanent errors have been detected in the following files: > > stor/onlinebackup at 2007-01-08_01:01:00:/1003/kreos11/HB1030/C_Root/Documents > and Settings/bvp/My Documents/My > Pictures/confidential/tconfidential/confidential/96...> Son we decided to destroy this snapshot, and then started another > Replication. > > This time the server crashed again :-(So, some of your data has been lost due to hardware failure, where the hardware has "silently" corrupted your data. ZFS has detected this. If you were to read this data (other than via ''zfs send''), you will get EIO, and as you note, ''zfs status'' shows what files are affected. The ''zfs send'' protocol isn''t able to tell the other side "this part of this file is corrupt", so it panics. This is a bug. The reason you''re seeing the panic when ''zfs send''-ing the next snapshot is that the (corrupt) data is shared between multiple snapshots. You can work around this by deleting or overwriting the files, then taking and sending a new snapshot. --matt