Hello ZFS mailinglist,
We are using ZFS (over ISCSI) on Opensolaris build 57
Today we encountered 2 crashes during a ZFS send/receive operation.
We tried to replicate a snapshot via the built-in send receive zfs
tools.
When we analyzed the resulted crash dump files we found the crash was
ZFS related.
bash-3.00# mdb -k unix.0 vmcore.0
Loading modules: [ unix genunix specfs dtrace cpu.AuthenticAMD.15
uppc pcplusmp scsi_vhci ufs md ip hook neti sctp arp usba fctl nca
lofs zfs random sppp cpc fcip crypto fcp logindmux ptm ipc nfs ]
> ::status
debugging crash dump vmcore.0 (64-bit) from NAS002
operating system: 5.11 snv_57 (i86pc)
panic message:
ZFS: bad checksum (read on <unknown> off 0: zio ffffffff3017b300 [L0
ZFS plain file] 20000L/20000P DVA[0]=<0:3b98ed1e800:25800> fletcher2
uncompressed LE contiguous birth=806063 fill=1 cksum=a487e32d
dump content: kernel pages only
So we ran a zpool status -v, and found that this snapshot (the one
we were trying to replicate) had some (28) permanent errors:
bash-3.00# zpool status -v
pool: home
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
home ONLINE 0 0 0
mirror ONLINE 0 0 0
c1d0s7 ONLINE 0 0 0
c2d0s7 ONLINE 0 0 0
errors: No known data errors
pool: stor
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
stor ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c4t2d0 ONLINE 0 0 0
c4t3d0 ONLINE 0 0 0
c4t4d0 ONLINE 0 0 0
c4t5d0 ONLINE 0 0 0
c4t6d0 ONLINE 0 0 0
c4t7d0 ONLINE 0 0 0
c4t8d0 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
stor/onlinebackup at 2007-01-08_01:01:00:/1003/kreos11/HB1030/
C_Root/Documents and Settings/bvp/My Documents/My Pictures/
confidential/tconfidential/confidential/96
stor/onlinebackup at 2007-01-08_01:01:00:/1003/kreos11/HB1030/
C_Root/Documents and Settings/bvp/My Documents/My Pictures//
confidential/tconfidential/confidential/97
....
Son we decided to destroy this snapshot, and then started another
Replication.
This time the server crashed again :-(
What can we do to avoid this kind of problems. Or is this a know bug
in build 57 ?
Thanks for you reply.
Kristof
Opensolaris Aserver wrote:> We tried to replicate a snapshot via the built-in send receive zfs tools....> ZFS: bad checksum (read on <unknown> off 0: zio ffffffff3017b300 [L0 ZFS > plain file] 20000L/20000P DVA[0]=<0:3b98ed1e800:25800> fletcher2 > uncompressed LE contiguous birth=806063 fill=1 cksum=a487e32d...> errors: Permanent errors have been detected in the following files: > > stor/onlinebackup at 2007-01-08_01:01:00:/1003/kreos11/HB1030/C_Root/Documents > and Settings/bvp/My Documents/My > Pictures/confidential/tconfidential/confidential/96...> Son we decided to destroy this snapshot, and then started another > Replication. > > This time the server crashed again :-(So, some of your data has been lost due to hardware failure, where the hardware has "silently" corrupted your data. ZFS has detected this. If you were to read this data (other than via ''zfs send''), you will get EIO, and as you note, ''zfs status'' shows what files are affected. The ''zfs send'' protocol isn''t able to tell the other side "this part of this file is corrupt", so it panics. This is a bug. The reason you''re seeing the panic when ''zfs send''-ing the next snapshot is that the (corrupt) data is shared between multiple snapshots. You can work around this by deleting or overwriting the files, then taking and sending a new snapshot. --matt