Peter Eriksson
2006-Nov-21 14:58 UTC
[zfs-discuss] ZFS goes catatonic when drives go dead?
This is a bit frustrating... If I create Zpool with some disks on a SAN (A3500FC) on a Sun Ultra 10 running Solaris 10 6/06 with all the latest patches: [0] kraiklyn:~# zpool status pool: galahad state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM galahad ONLINE 0 0 0 raidz ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c1t5d1 ONLINE 0 0 0 c1t5d2 ONLINE 0 0 0 c1t5d3 ONLINE 0 0 0 c1t5d4 ONLINE 0 0 0 raidz ONLINE 0 0 0 c1t5d5 ONLINE 0 0 0 c1t5d6 ONLINE 0 0 0 c1t5d7 ONLINE 0 0 0 c1t5d8 ONLINE 0 0 0 c1t5d9 ONLINE 0 0 0 errors: No known data errors The fail a drive to simulate a disk failure (same thing happens if I actually pull the disk from the system) then this happens: [1] kraiklyn:~# drivutil -f 20 c1t5d0 drivutil succeeded! [0] kraiklyn:~# zpool status pool: galahad state: ONLINE scrub: none requested It never recovers from this state. I went for some coffee at this state and it''s still here when I came back. If I start another terminal, and then try "format", that too hangs. I can''t even reboot the machine (it just hangs) without having to do a RETURN ~ Ctrl-B to get out to the OpenBoot prompt. After a reboot it shows the expected output: [0] kraiklyn:~# zpool status pool: galahad state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-4J scrub: none requested config: NAME STATE READ WRITE CKSUM galahad DEGRADED 0 0 0 raidz DEGRADED 0 0 0 c1t5d0 ONLINE 0 0 0 c1t5d1 UNAVAIL 0 0 0 corrupted data c1t5d2 ONLINE 0 0 0 c1t5d3 ONLINE 0 0 0 c1t5d4 ONLINE 0 0 0 raidz ONLINE 0 0 0 c1t5d5 ONLINE 0 0 0 c1t5d6 ONLINE 0 0 0 c1t5d7 ONLINE 0 0 0 c1t5d8 ONLINE 0 0 0 c1t5d9 ONLINE 0 0 0 errors: No known data errors However, if I stay away from ZFS/Zpool and just use the raw devices (or use SVM to handle things) then things work as expected - the LUN goes away/generates I/O errors when I fail that disk and things come back when I "unfail" it. No hangs... It almost feels like ZFS causes some "lock" in the kernel. This message posted from opensolaris.org
Peter Eriksson
2006-Nov-21 16:30 UTC
[zfs-discuss] Re: ZFS goes catatonic when drives go dead?
Heh.... Found a workaround : Wrap all the real disk devices with a SVM metadevices, and then put those into a ZFS raidz volume. Now... This doesn''t *really* feel right, somehow, but if it works, then it works... :-) # zpool status pool: foobar state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM foobar ONLINE 0 0 0 raidz ONLINE 0 0 0 /dev/md/dsk/d101 ONLINE 0 0 0 /dev/md/dsk/d102 ONLINE 0 0 0 /dev/md/dsk/d103 ONLINE 0 0 0 /dev/md/dsk/d104 ONLINE 0 0 0 /dev/md/dsk/d105 ONLINE 0 0 0 # drivutil -f 23 c1t5d0 # cp /var/adm/messages /foobar/ # zpool status [pool: foobar state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using ''zpool online''. see: http://www.sun.com/msg/ZFS-8000-D3 scrub: none requested config: NAME STATE READ WRITE CKSUM foobar DEGRADED 0 0 0 raidz DEGRADED 0 0 0 /dev/md/dsk/d101 ONLINE 0 0 0 /dev/md/dsk/d102 UNAVAIL 0 141 0 cannot open /dev/md/dsk/d103 ONLINE 0 0 0 /dev/md/dsk/d104 ONLINE 0 0 0 /dev/md/dsk/d105 ONLINE 0 0 0 errors: No known data errors # zpool online foobar /dev/md/dsk/d102 Bringing device /dev/md/dsk/d102 online # zpool status pool: foobar state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed with 0 errors on Tue Nov 21 17:29:40 2006 config: NAME STATE READ WRITE CKSUM foobar ONLINE 0 0 0 raidz ONLINE 0 0 0 /dev/md/dsk/d101 ONLINE 0 0 0 /dev/md/dsk/d102 ONLINE 0 141 0 /dev/md/dsk/d103 ONLINE 0 0 0 /dev/md/dsk/d104 ONLINE 0 0 0 /dev/md/dsk/d105 ONLINE 0 0 0 errors: No known data errors This message posted from opensolaris.org
Richard Elling
2006-Nov-21 18:45 UTC
[zfs-discuss] ZFS goes catatonic when drives go dead?
I think this is in the FAQ, if not, then it should be. Full integration between ZFS and FMA is not yet available. Until it is available, there are some failure modes which are not handled perfectly well. It really depends on the entire software+firmware+hardware stack as to how it will react under the various failure scenarios during this interim time period. -- richard Peter Eriksson wrote:> This is a bit frustrating... If I create Zpool with some disks on a SAN (A3500FC) > on a Sun Ultra 10 running Solaris 10 6/06 with all the latest patches: > > [0] kraiklyn:~# zpool status > pool: galahad > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > galahad ONLINE 0 0 0 > raidz ONLINE 0 0 0 > c1t5d0 ONLINE 0 0 0 > c1t5d1 ONLINE 0 0 0 > c1t5d2 ONLINE 0 0 0 > c1t5d3 ONLINE 0 0 0 > c1t5d4 ONLINE 0 0 0 > raidz ONLINE 0 0 0 > c1t5d5 ONLINE 0 0 0 > c1t5d6 ONLINE 0 0 0 > c1t5d7 ONLINE 0 0 0 > c1t5d8 ONLINE 0 0 0 > c1t5d9 ONLINE 0 0 0 > > errors: No known data errors > > The fail a drive to simulate a disk failure (same thing happens if I actually > pull the disk from the system) then this happens: > > [1] kraiklyn:~# drivutil -f 20 c1t5d0 > > drivutil succeeded! > [0] kraiklyn:~# zpool status > pool: galahad > state: ONLINE > scrub: none requested > > > It never recovers from this state. I went for some coffee at this state and it''s still > here when I came back. If I start another terminal, and then try "format", that too hangs. > > I can''t even reboot the machine (it just hangs) without having to do > a RETURN ~ Ctrl-B to get out to the OpenBoot prompt. > > After a reboot it shows the expected output: > > [0] kraiklyn:~# zpool status > pool: galahad > state: DEGRADED > status: One or more devices could not be used because the label is missing or > invalid. Sufficient replicas exist for the pool to continue > functioning in a degraded state. > action: Replace the device using ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-4J > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > galahad DEGRADED 0 0 0 > raidz DEGRADED 0 0 0 > c1t5d0 ONLINE 0 0 0 > c1t5d1 UNAVAIL 0 0 0 corrupted data > c1t5d2 ONLINE 0 0 0 > c1t5d3 ONLINE 0 0 0 > c1t5d4 ONLINE 0 0 0 > raidz ONLINE 0 0 0 > c1t5d5 ONLINE 0 0 0 > c1t5d6 ONLINE 0 0 0 > c1t5d7 ONLINE 0 0 0 > c1t5d8 ONLINE 0 0 0 > c1t5d9 ONLINE 0 0 0 > > errors: No known data errors > > > However, if I stay away from ZFS/Zpool and just use the raw devices (or use SVM to handle things) then things work as expected - the LUN goes away/generates I/O errors when I fail that disk and things come back when I "unfail" it. No hangs... > > It almost feels like ZFS causes some "lock" in the kernel. > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Peter Eriksson
2006-Nov-22 11:38 UTC
[zfs-discuss] Re: ZFS goes catatonic when drives go dead?
There is nothing in the ZFS FAQ about this. I also fail to see how FMA could make any difference since it seems that ZFS is deadlocking somewhere in the kernel when this happens... It works if you wrap all the physical devices inside SVM metadevices and use those for your ZFS/zpool instead. Ie: metainit d101 1 1 c1t5d0s0 metainit d102 1 1 c1t5d1s0 metainit d103 1 1 c1t5d2s0 zpool create foo radz /dev/md/dsk/d101 /dev/md/dsk/d102 /dev/md/dsk/d103 Another unrelated observation - I''ve noticed that ZFS often works *faster* if I wrap a physical partition inside a metadevice and then feed that to zpool instead of using the raw partition directly with zpool... Example: Testing ZFS on a spare 40GB partition of the boot ATA disk in an Sun Ultra 10/440 gives horrible performance numbers. If I wrap that into a simple metadevice and feed to ZFS things work much faster... Ie: Zpool containing one normal disk partition: # /bin/time mkfile 1G 1G real 2:46.5 user 0.4 sys 24.1 --> 6MB/s (that was actually the best number I got - the worst was 3:03 minutes) Zpool containing one SVM metadevice containing the same disk partition: #/bin/time mkfile 1G 1G real 1:41.6 user 0.3 sys 23.3 --> 10MB/s (Idle machine in both cases, mkfile rerun a couple of times, with the same results. I removed the 1G file between reruns of course) This message posted from opensolaris.org
Richard Elling
2006-Nov-22 17:26 UTC
[zfs-discuss] Re: ZFS goes catatonic when drives go dead?
Peter Eriksson wrote:> There is nothing in the ZFS FAQ about this. I also fail to see how FMA could make any > difference since it seems that ZFS is deadlocking somewhere in the kernel when this happens...Some people don''t see a difference between "hung" and "patiently waiting." There are failure modes where you would patiently wait. With full FMA integration the system will know that patiently waiting is futile.> It works if you wrap all the physical devices inside SVM metadevices and use those for your > ZFS/zpool instead. Ie: > > metainit d101 1 1 c1t5d0s0 > metainit d102 1 1 c1t5d1s0 > metainit d103 1 1 c1t5d2s0 > zpool create foo radz /dev/md/dsk/d101 /dev/md/dsk/d102 /dev/md/dsk/d103 > > Another unrelated observation - I''ve noticed that ZFS often works *faster* if I wrap a > physical partition inside a metadevice and then feed that to zpool instead of using > the raw partition directly with zpool... Example: Testing ZFS on a spare 40GB partition > of the boot ATA disk in an Sun Ultra 10/440 gives horrible performance numbers. If I > wrap that into a simple metadevice and feed to ZFS things work much faster... Ie:More likely this is: 6421427 netra x1 slagged by NFS over ZFS leading to long spins in the ATA driver code -- richard
Pawel Jakub Dawidek
2006-Nov-23 11:09 UTC
[zfs-discuss] Re: ZFS goes catatonic when drives go dead?
On Wed, Nov 22, 2006 at 03:38:05AM -0800, Peter Eriksson wrote:> There is nothing in the ZFS FAQ about this. I also fail to see how FMA could make any difference since it seems that ZFS is deadlocking somewhere in the kernel when this happens... > > It works if you wrap all the physical devices inside SVM metadevices and use those for your > ZFS/zpool instead. Ie: > > metainit d101 1 1 c1t5d0s0 > metainit d102 1 1 c1t5d1s0 > metainit d103 1 1 c1t5d2s0 > zpool create foo radz /dev/md/dsk/d101 /dev/md/dsk/d102 /dev/md/dsk/d103 > > Another unrelated observation - I''ve noticed that ZFS often works *faster* if I wrap a physical partition inside a metadevice and then feed that to zpool instead of using the raw partition directly with zpool... Example: Testing ZFS on a spare 40GB partition of the boot ATA disk in an Sun Ultra 10/440 gives horrible performance numbers. If I wrap that into a simple metadevice and feed to ZFS things work much faster... Ie: > > Zpool containing one normal disk partition: > > # /bin/time mkfile 1G 1G > real 2:46.5 > user 0.4 > sys 24.1 > --> 6MB/s (that was actually the best number I got - the worst was 3:03 minutes) > > Zpool containing one SVM metadevice containing the same disk partition: > > #/bin/time mkfile 1G 1G > real 1:41.6 > user 0.3 > sys 23.3 > --> 10MB/s > > (Idle machine in both cases, mkfile rerun a couple of times, with the same results. I removed the 1G file between reruns of course)It may be because for raw disks ZFS flushes write cache (via DKIOCFLUSHWRITECACHE), which can be expensive operation and highly depend on disks/controllers used. I doubt it does the same for metadevices, but I may be wrong. -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061123/d44df423/attachment.bin>
Pawel Jakub Dawidek
2006-Nov-23 11:19 UTC
[zfs-discuss] Re: ZFS goes catatonic when drives go dead?
On Thu, Nov 23, 2006 at 12:09:09PM +0100, Pawel Jakub Dawidek wrote:> On Wed, Nov 22, 2006 at 03:38:05AM -0800, Peter Eriksson wrote: > > There is nothing in the ZFS FAQ about this. I also fail to see how FMA could make any difference since it seems that ZFS is deadlocking somewhere in the kernel when this happens... > > > > It works if you wrap all the physical devices inside SVM metadevices and use those for your > > ZFS/zpool instead. Ie: > > > > metainit d101 1 1 c1t5d0s0 > > metainit d102 1 1 c1t5d1s0 > > metainit d103 1 1 c1t5d2s0 > > zpool create foo radz /dev/md/dsk/d101 /dev/md/dsk/d102 /dev/md/dsk/d103 > > > > Another unrelated observation - I''ve noticed that ZFS often works *faster* if I wrap a physical partition inside a metadevice and then feed that to zpool instead of using the raw partition directly with zpool... Example: Testing ZFS on a spare 40GB partition of the boot ATA disk in an Sun Ultra 10/440 gives horrible performance numbers. If I wrap that into a simple metadevice and feed to ZFS things work much faster... Ie: > > > > Zpool containing one normal disk partition: > > > > # /bin/time mkfile 1G 1G > > real 2:46.5 > > user 0.4 > > sys 24.1 > > --> 6MB/s (that was actually the best number I got - the worst was 3:03 minutes) > > > > Zpool containing one SVM metadevice containing the same disk partition: > > > > #/bin/time mkfile 1G 1G > > real 1:41.6 > > user 0.3 > > sys 23.3 > > --> 10MB/s > > > > (Idle machine in both cases, mkfile rerun a couple of times, with the same results. I removed the 1G file between reruns of course) > > It may be because for raw disks ZFS flushes write cache (via > DKIOCFLUSHWRITECACHE), which can be expensive operation and highly > depend on disks/controllers used. I doubt it does the same for > metadevices, but I may be wrong.Oops, you operate on partitions... I think for partitions ZFS disables write cache on disks... Anyway, I''ll leave the answer to someone more clueful. -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061123/568546aa/attachment.bin>