Is there any flaw with the process below, customer asked:
Sun Cluser with each zpool composed of 1 Lun (yes, they have been
advised to use redundant config instead). They do not export the pool to
other host instead they use BCV to make a mirror of the lun. They then
split the mirror and import the lun/zpool onto a machine not even part
of the cluster - backup server.
Most of the time, the import seems to work, but maybe about 10-15% of
the time it panics the system with bad checksum . Customer do this
procedure on 9 luns, 2 times a day. They have been doing the same thing
with vxfs/VxVM for some time without any issue.
They were recommended to run scrub on a regular basis. I have also
provided list of things to check that has a potential to cause checksum
error:
1- Exporting LUNs to two different hosts and creating a zpool on it. I
have seen it at one customer where one host had a ufs file system on the
same LUN that is used by other hosts in its zpool.
2- Accessing the LUN by other means (dd of=/dev/..emcpower11c) that is
under ZFScontrol can cause corrupted data.
3- Mistakenly adding same device with different names in zpool. EMC
Powerpath and Sun Multipath can have multiple device names pointing to
the same device.
4- Importing device without exporting it first.
5- Bad Hardware, Storage or Controller bugs
6- ZFS is not cluster aware, it means one should use clustering software
when sharing zpool across multiple hosts. Poor man cluster is not supported!
7- If LUNS exported to ZFS are RAID-5 types. See URL about RAID-5 issues:
http://blogs.sun.com/bonwick/entry/raid_z
http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt
Consider reading the article from Jeff Bonwick
http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data
Amer.
Analysis:
SolarisCAT(vmcore.1/10V)> stat
core file: /cores/dir31/66015657/vmcore.1
user: Cores User (cores:911)
release: 5.10 (64-bit)
version: Generic_127127-11
machine: sun4v
node name: bansai
domain: gov.edmonton.ab.ca
hw_provider: Sun_Microsystems
system type: SUNW,SPARC-Enterprise-T5220 (UltraSPARC-T2)
hostid: 84ac5f08
dump_conflags: 0x10000 (DUMP_KERNEL) on /dev/dsk/c1t0d0s1(62.8G)
time of crash: Sat Jul 19 22:42:55 MDT 2008 (core is 33 days old)
age of system: 1 days 6 hours 4 minutes 47.86 seconds
panic CPU: 56 (64 CPUs, 31.8G memory)
panic string: ZFS: bad checksum (read on <unknown> off 0: zio 300743cec4
0 [L0 SPA space map] 1000L/a00P DVA[0]=<0:484a15000:a00> DVA[1]=<0:1df9
054a00:a00> fletcher4 lzjb BE contiguous birth=17010763 fill=1 cksum=8
!zio involved:
SolarisCAT(vmcore.1/10V)> sdump 300743cec40 zio_t io_spa,io_type,io_error
io_spa = 0x300ee32c4c0
io_type = 1 (ZIO_TYPE_READ)
io_error = 0x32 <<<
!zpool that had blocks with checksum errors:
Block read on the file system that was using ZFS pool "sapcrp" had a
checksum error. zio involved had an io_error:50 (errno: 50 (32 hex)).
#define EBADE 50 /* invalid exchange */
zio checksum error (ECKSUM) are reported as EBADE errno).
src code:
"
/*
* We''ll take the unused errno ''EBADE'' (from the
Convergent graveyard)
* to indicate checksum errors.
*/
#define ECKSUM EBADE <<
"
Due to ZFS''s checksum ability data read had checksum computed and
compared it to the stored value, which should be the same if the data is
good. Since the checksums are different, hence zfs concluded that data
is corrupted!
If the storage pool had been setup in a ZFS redundant configuration
(mirroring or raidz), then ZFS could have gone to the mirror/parity,
read a good value and self corrected (heal) the other side of the mirror.
Unfortunately pool is configured in a non-redundant fashion as far
as ZFS is concerned. No reduntant configuration: mirror or raidz is used
and checksum error left no good copy of the data and that resulted in
panic. If there are multiple vdevs configured, ZFS can able to heal the
data by reading a block with good checksum.
Zfs version 2 has metadata replication. Thus multiple vdevs with
raidz(2) or mir
ror are more resilient to these failure because metadata can be replicated.
Also, raidz and raidz2 can have multiple vdev by creating stripe across
raidz vdev groups. One can create 4 raidz groups out of 16 drive and
then stripe accross four raidz groups. Each raidz group can handle one
error or 1 disk failure, it means 4 errors can be handled by 4 raidz
group. With striping we are also increa
sing IO bandwith.
There is no replication provided with Hardware raid LUNS (EMC) because
only one vdev exported to ZFS. It is recommeneded, if possible, to
create multiple simple hardware LUNS and export it to ZFS and then
configure ZFS to create raidz groups and strip across this group. Using
this strategy, you will have a benefit of having hardware RAID boxes
providing large caches for faster updates and ZFS ability to heal the
data on-the-fly with multiple vdevs under its control.
> 0x300ee32c4c0::spa
ADDR STATE NAME
00000300ee32c4c0 ACTIVE sapcrp
> 0x300ee32c4c0::spa -v
ADDR STATE NAME
00000300ee32c4c0 ACTIVE sapcrp
ADDR STATE AUX DESCRIPTION
000006005ebcdac0 HEALTHY -
0000030015400fc0 HEALTHY -
/dev/dsk/c6t6006048000018772084654
574F333445d0s0
> 0x300ee32c4c0::spa -cv
ADDR STATE NAME
00000300ee32c4c0 ACTIVE sapcrp
(none)
ADDR STATE AUX DESCRIPTION
000006005ebcdac0 HEALTHY -
0000030015400fc0 HEALTHY -
/dev/dsk/c6t6006048000018772084654
574F333445d0s0
> 0x300ee32c4c0::spa -e
ADDR STATE NAME
00000300ee32c4c0 ACTIVE sapcrp
ADDR STATE AUX DESCRIPTION
000006005ebcdac0 HEALTHY -
READ WRITE FREE CLAIM
IOCTL
OPS 0 0 0 0
0
BYTES 0 0 0 0
0
EREAD 0
EWRITE 0
ECKSUM 0
0000030015400fc0 HEALTHY -
/dev/dsk/c6t6006048000018772084654
574F333445d0s0
READ WRITE FREE CLAIM
IOCTL
OPS 0x88 0x18 0 0
0
BYTES 0x841c00 0x10a00 0 0
0
EREAD 0
EWRITE 0
ECKSUM 0x4
Device: c6t6006048000018772084654574F333445d0s0 ->
../../devices/scsi_vhci/s
sd at g6006048000018772084654574f333445
.