Last Friday, one of our V880s kernel panicked with the following message.This is a SAN connected ZFS pool attached to one LUN. From this, it appears that the SAN ''disappeared'' and then there was a panic shortly after. Am I reading this correctly? Is this normal behavior for ZFS? This is a mostly patched Solaris 10 6/06 install. Before patching this system we did have a couple of NFS related panics, always on Fridays! This is the fourth panic, first time with a ZFS error. There are no errors in zpool status. Dec 1 20:30:21 foobar scsi: [ID 107833 kern.warning] WARNING: /pci at 9,600000/fibre-channel at 1/sd at 1,1 (sd17): Dec 1 20:30:21 foobar SCSI transport failed: reason ''incomplete'': retrying command Dec 1 20:30:21 foobar scsi: [ID 107833 kern.warning] WARNING: /pci at 9,600000/fibre-channel at 1/sd at 1,1 (sd17): Dec 1 20:30:21 foobar SCSI transport failed: reason ''incomplete'': retrying command Dec 1 20:30:21 foobar scsi: [ID 107833 kern.warning] WARNING: /pci at 9,600000/fibre-channel at 1/sd at 1,1 (sd17): Dec 1 20:30:21 foobar disk not responding to selection Dec 1 20:30:21 foobar scsi: [ID 107833 kern.warning] WARNING: /pci at 9,600000/fibre-channel at 1/sd at 1,1 (sd17): Dec 1 20:30:21 foobar disk not responding to selection Dec 1 20:30:21 foobar scsi: [ID 107833 kern.warning] WARNING: /pci at 9,600000/fibre-channel at 1/sd at 1,1 (sd17): Dec 1 20:30:21 foobar disk not responding to selection Dec 1 20:30:21 foobar scsi: [ID 107833 kern.warning] WARNING: /pci at 9,600000/fibre-channel at 1/sd at 1,1 (sd17): Dec 1 20:30:21 foobar disk not responding to selection Dec 1 20:30:22 foobar scsi: [ID 107833 kern.warning] WARNING: /pci at 9,600000/fibre-channel at 1/sd at 1,1 (sd17): Dec 1 20:30:22 foobar disk not responding to selection Dec 1 20:30:22 foobar unix: [ID 836849 kern.notice] Dec 1 20:30:22 foobar ^Mpanic[cpu2]/thread=2a100aedcc0: Dec 1 20:30:22 foobar unix: [ID 809409 kern.notice] ZFS: I/O failure (write on <unknown> off 0: zio 3004c0ce540 [L0 unallocated] 20000L/20000P DVA [0]=<0:2ae1900000:20000> fletcher2 uncompressed BE contiguous birth=586818 fill=0 cksum=102297a2db39dfc:cc8e38087da7a38f:239520856ececf15:c2fd36 9cea9db4a1): error 5 Dec 1 20:30:22 foobar unix: [ID 100000 kern.notice] Dec 1 20:30:22 foobar genunix: [ID 723222 kern.notice] 000002a100aed740 zfs:zio_done+284 (3004c0ce540, 0, a8, 70513bf0, 0, 60001374940) Dec 1 20:30:22 foobar genunix: [ID 179002 kern.notice] %l0-3: 000003006319fc80 0000000070513800 0000000000000005 0000000000000005 Dec 1 20:30:22 foobar %l4-7: 000000007b224278 0000000000000002 000000000008f442 0000000000000005 Dec 1 20:30:22 foobar genunix: [ID 723222 kern.notice] 000002a100aed940 zfs:zio_vdev_io_assess+178 (3004c0ce540, 8000, 10, 0, 0, 10) Dec 1 20:30:22 foobar genunix: [ID 179002 kern.notice] %l0-3: 0000000000000002 0000000000000001 0000000000000000 0000000000000005 Dec 1 20:30:22 foobar %l4-7: 0000000000000010 0000000035a536bc 0000000000000000 00043d7293172cfc Dec 1 20:30:22 foobar genunix: [ID 723222 kern.notice] 000002a100aeda00 genunix:taskq_thread+1a4 (600012a0c38, 600012a0be0, 50001, 43d72c8bfb810, 2a100aedaca, 2a100aedac8) Dec 1 20:30:22 foobar genunix: [ID 179002 kern.notice] %l0-3: 0000000000010000 00000600012a0c08 00000600012a0c10 00000600012a0c12 Dec 1 20:30:22 foobar %l4-7: 0000030060946320 0000000000000002 0000000000000000 00000600012a0c00 Dec 1 20:30:22 foobar unix: [ID 100000 kern.notice] Dec 1 20:30:22 foobar genunix: [ID 672855 kern.notice] syncing file systems...
Douglas Denny wrote: > Last Friday, one of our V880s kernel panicked with the following > message.This is a SAN connected ZFS pool attached to one LUN. From > this, it appears that the SAN ''disappeared'' and then there was a panic > shortly after. > > Am I reading this correctly? Yes. > Is this normal behavior for ZFS? Yes. You have no redundancy (from ZFS'' point of view at least), so ZFS has no option except panicing in order to maintain the integrity of your data. > This is a mostly patched Solaris 10 6/06 install. Before patching this > system we did have a couple of NFS related panics, always on Fridays! > This is the fourth panic, first time with a ZFS error. There are no > errors in zpool status. Without data, it is difficult to suggest what might have caused your NFS panics. James C. McPherson -- Solaris kernel software engineer, system admin and troubleshooter http://www.jmcp.homeunix.com/blog Find me on LinkedIn @ http://www.linkedin.com/in/jamescmcpherson
On 12/4/06, James C. McPherson <James.C.McPherson at gmail.com> wrote:> > Is this normal behavior for ZFS? > > Yes. You have no redundancy (from ZFS'' point of view at least), > so ZFS has no option except panicing in order to maintain the > integrity of your data.This is interesting from a implementation point of view. Any singly attached SAN connection that has a disconnect from its switch/backend will cause the ZFS to panic, why would it not wait and see if the device came back? Should all SAN connected ZFS pools have redundancy built in with dual HBAs to dual SAN switches/controllers? -Doug
Douglas Denny wrote:> On 12/4/06, James C. McPherson <James.C.McPherson at gmail.com> wrote: >> > Is this normal behavior for ZFS? >> >> Yes. You have no redundancy (from ZFS'' point of view at least), >> so ZFS has no option except panicing in order to maintain the >> integrity of your data. > > This is interesting from a implementation point of view. Any singly > attached SAN connection that has a disconnect from its switch/backend > will cause the ZFS to panic, why would it not wait and see if the > device came back? Should all SAN connected ZFS pools have redundancy > built in with dual HBAs to dual SAN switches/controllers?If you look into your /var/adm/messages file, you should see more than a few seconds'' worth of IO retries, indicating that there was a delay before panicing while waiting for the device to return. Answering your second question, all ZFS pools should be configured with redundancy from ZFS'' point of view. James C. McPherson -- Solaris kernel software engineer, system admin and troubleshooter http://www.jmcp.homeunix.com/blog Find me on LinkedIn @ http://www.linkedin.com/in/jamescmcpherson
On 12/4/06, James C. McPherson <James.C.McPherson at gmail.com> wrote:> If you look into your /var/adm/messages file, you should see > more than a few seconds'' worth of IO retries, indicating that > there was a delay before panicing while waiting for the device > to return.My original post contains all the warnings. The first error happened at 20:30:21 and the system paniced at 20:30:21. It makes me wonder if there is something else going here.> Answering your second question, all ZFS pools should be configured > with redundancy from ZFS'' point of view.I am sure this is the right answer, but it is not obvious to me how I would do this like I do this with UFS file systems using the SAN as the redundant file backing. Thanks for the feedback. -Doug
Douglas Denny wrote:> On 12/4/06, James C. McPherson <James.C.McPherson at gmail.com> wrote: >> If you look into your /var/adm/messages file, you should see >> more than a few seconds'' worth of IO retries, indicating that >> there was a delay before panicing while waiting for the device >> to return. > > My original post contains all the warnings. The first error happened > at 20:30:21 and the system paniced at 20:30:21. It makes me wonder if > there is something else going here.That''s surprising. My experience of non-redundant pools (root pools no less :>) is that there would be several minutes of retries, when all the sd and lower layers'' retries were added up.>> Answering your second question, all ZFS pools should be configured >> with redundancy from ZFS'' point of view. > > I am sure this is the right answer, but it is not obvious to me how I > would do this like I do this with UFS file systems using the SAN as > the redundant file backing. Thanks for the feedback.create 2 luns on your san. zone them so your host can see them zpool create poolname mirror vdev1 vdev2 zfs create poolname/fsname For my ultra20, I have / + /usr + /var and some of /opt mirrored using svm, and then I have an uber-pool to contain everything else: $ zdb -C sink version=3 name=''sink'' state=0 txg=4 pool_guid=6548940762722570489 vdev_tree type=''root'' id=0 guid=6548940762722570489 children[0] type=''mirror'' id=0 guid=5106440632267737007 metaslab_array=13 metaslab_shift=31 ashift=9 asize=307077840896 children[0] type=''disk'' id=0 guid=9432259574297221550 path=''/dev/dsk/c1d0s3'' devid=''id1,cmdk at AST3320620AS=____________4QF01RZE/d'' whole_disk=0 children[1] type=''disk'' id=1 guid=7176220706626775710 path=''/dev/dsk/c2d0s3'' devid=''id1,cmdk at AST3320620AS=____________3QF0EAFP/d'' whole_disk=0 whch I created by first slicing the disks, then running # zpool create sink mirror c1d0s3 c2d0s3 Under the "sink" zpool, I have a few zfs: $ zfs list NAME USED AVAIL REFER MOUNTPOINT sink 119G 161G 24.5K /sink sink/hole 574M 161G 574M /opt/csw sink/home 2.22G 161G 2.22G /export/home sink/scratch 96.1G 161G 96.1G /scratch sink/src 6.66G 161G 6.66G /opt/gate sink/swim 555M 161G 555M /opt/local sink/zones 12.9G 161G 27.5K /zones sink/zones/kitchensink 10.6G 161G 10.6G /zones/kitchensink which I created with # zfs create sink/hole # zfs create sink/home etc etc. James C. McPherson -- Solaris kernel software engineer, system admin and troubleshooter http://www.jmcp.homeunix.com/blog Find me on LinkedIn @ http://www.linkedin.com/in/jamescmcpherson
If you take a look at these messages the somewhat unusual condition that may lead to unexpected behaviour (ie. fast giveup) is that whilst this is a SAN connection it is achieved through a non- Leadville config, note the fibre-channel and sd references. In a Leadville compliant installation this would be the ssd driver, hence you''d have to investigate the specific semantics and driver tweaks that this system has applied to sd in this instance. Maybe the sd retries have been `tuned` down ... ?? More info ... ie. an explorer would be useful ... before we jump to any incorrect conclusions. Craig On 4 Dec 2006, at 14:47, Douglas Denny wrote:> Last Friday, one of our V880s kernel panicked with the following > message.This is a SAN connected ZFS pool attached to one LUN. From > this, it appears that the SAN ''disappeared'' and then there was a panic > shortly after. > > Am I reading this correctly? > > Is this normal behavior for ZFS? > > This is a mostly patched Solaris 10 6/06 install. Before patching this > system we did have a couple of NFS related panics, always on Fridays! > This is the fourth panic, first time with a ZFS error. There are no > errors in zpool status. > > Dec 1 20:30:21 foobar scsi: [ID 107833 kern.warning] WARNING: > /pci at 9,600000/fibre-channel at 1/sd at 1,1 (sd17): > Dec 1 20:30:21 foobar SCSI transport failed: reason ''incomplete'': > retrying command > Dec 1 20:30:21 foobar scsi: [ID 107833 kern.warning] WARNING: > /pci at 9,600000/fibre-channel at 1/sd at 1,1 (sd17): > Dec 1 20:30:21 foobar SCSI transport failed: reason ''incomplete'': > retrying command > Dec 1 20:30:21 foobar scsi: [ID 107833 kern.warning] WARNING: > /pci at 9,600000/fibre-channel at 1/sd at 1,1 (sd17): > Dec 1 20:30:21 foobar disk not responding to selection > Dec 1 20:30:21 foobar scsi: [ID 107833 kern.warning] WARNING: > /pci at 9,600000/fibre-channel at 1/sd at 1,1 (sd17): > Dec 1 20:30:21 foobar disk not responding to selection > Dec 1 20:30:21 foobar scsi: [ID 107833 kern.warning] WARNING: > /pci at 9,600000/fibre-channel at 1/sd at 1,1 (sd17): > Dec 1 20:30:21 foobar disk not responding to selection > Dec 1 20:30:21 foobar scsi: [ID 107833 kern.warning] WARNING: > /pci at 9,600000/fibre-channel at 1/sd at 1,1 (sd17): > Dec 1 20:30:21 foobar disk not responding to selection > Dec 1 20:30:22 foobar scsi: [ID 107833 kern.warning] WARNING: > /pci at 9,600000/fibre-channel at 1/sd at 1,1 (sd17): > Dec 1 20:30:22 foobar disk not responding to selection > Dec 1 20:30:22 foobar unix: [ID 836849 kern.notice] > Dec 1 20:30:22 foobar ^Mpanic[cpu2]/thread=2a100aedcc0: > Dec 1 20:30:22 foobar unix: [ID 809409 kern.notice] ZFS: I/O failure > (write on <unknown> off 0: zio 3004c0ce540 [L0 unallocated] > 20000L/20000P DVA > [0]=<0:2ae1900000:20000> fletcher2 uncompressed BE contiguous > birth=586818 fill=0 > cksum=102297a2db39dfc:cc8e38087da7a38f:239520856ececf15:c2fd36 > 9cea9db4a1): error 5 > Dec 1 20:30:22 foobar unix: [ID 100000 kern.notice] > Dec 1 20:30:22 foobar genunix: [ID 723222 kern.notice] > 000002a100aed740 zfs:zio_done+284 (3004c0ce540, 0, a8, 70513bf0, 0, > 60001374940) > Dec 1 20:30:22 foobar genunix: [ID 179002 kern.notice] %l0-3: > 000003006319fc80 0000000070513800 0000000000000005 0000000000000005 > Dec 1 20:30:22 foobar %l4-7: 000000007b224278 0000000000000002 > 000000000008f442 0000000000000005 > Dec 1 20:30:22 foobar genunix: [ID 723222 kern.notice] > 000002a100aed940 zfs:zio_vdev_io_assess+178 (3004c0ce540, 8000, 10, 0, > 0, 10) > Dec 1 20:30:22 foobar genunix: [ID 179002 kern.notice] %l0-3: > 0000000000000002 0000000000000001 0000000000000000 0000000000000005 > Dec 1 20:30:22 foobar %l4-7: 0000000000000010 0000000035a536bc > 0000000000000000 00043d7293172cfc > Dec 1 20:30:22 foobar genunix: [ID 723222 kern.notice] > 000002a100aeda00 genunix:taskq_thread+1a4 (600012a0c38, 600012a0be0, > 50001, 43d72c8bfb810, > 2a100aedaca, 2a100aedac8) > Dec 1 20:30:22 foobar genunix: [ID 179002 kern.notice] %l0-3: > 0000000000010000 00000600012a0c08 00000600012a0c10 00000600012a0c12 > Dec 1 20:30:22 foobar %l4-7: 0000030060946320 0000000000000002 > 0000000000000000 00000600012a0c00 > Dec 1 20:30:22 foobar unix: [ID 100000 kern.notice] > Dec 1 20:30:22 foobar genunix: [ID 672855 kern.notice] syncing > file systems... > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2693 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061204/26d9a8d5/attachment.bin>
Douglas Denny wrote:> On 12/4/06, James C. McPherson <James.C.McPherson at gmail.com> wrote: >> > Is this normal behavior for ZFS? >> >> Yes. You have no redundancy (from ZFS'' point of view at least), >> so ZFS has no option except panicing in order to maintain the >> integrity of your data. > > This is interesting from a implementation point of view. Any singly > attached SAN connection that has a disconnect from its switch/backend > will cause the ZFS to panic, why would it not wait and see if the > device came back? Should all SAN connected ZFS pools have redundancy > built in with dual HBAs to dual SAN switches/controllers?UFS will panic on EIO also. Most other file systems, too. You can put UFS on top of SVM, but unless SVM is configured for redundancy, it (UFS) would still panic in such situations. ZFS doesn''t bring anything new here, but I sense a change in expectations that I can''t quite reconcile. -- richard
Hi all, Having experienced this, it would be nice if there was an option to offline the filesystem instead of kernel panicking on a per-zpool basis. If its a system-critical partition like a database I''d prefer it to kernel-panick and thereby trigger a fail-over of the application. However, if its a zpool hosting some fileshares I''d prefer it to stay online. Putting that level of control in would alleviate a lot of the complaints it seems to me...or at least give less of a leg to stand on. ;-) A nasty little notice that tells you the system will kernel panick if a vdev becomes unavailable, wouldn''t be bad either when you''re creating a striped zpool. Even the best of us forgets these things. Best Regards, Jason On 12/4/06, Richard Elling <Richard.Elling at sun.com> wrote:> Douglas Denny wrote: > > On 12/4/06, James C. McPherson <James.C.McPherson at gmail.com> wrote: > >> > Is this normal behavior for ZFS? > >> > >> Yes. You have no redundancy (from ZFS'' point of view at least), > >> so ZFS has no option except panicing in order to maintain the > >> integrity of your data. > > > > This is interesting from a implementation point of view. Any singly > > attached SAN connection that has a disconnect from its switch/backend > > will cause the ZFS to panic, why would it not wait and see if the > > device came back? Should all SAN connected ZFS pools have redundancy > > built in with dual HBAs to dual SAN switches/controllers? > > UFS will panic on EIO also. Most other file systems, too. > You can put UFS on top of SVM, but unless SVM is configured for > redundancy, it (UFS) would still panic in such situations. ZFS > doesn''t bring anything new here, but I sense a change in expectations > that I can''t quite reconcile. > -- richard > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Jason J. W. Williams wrote:> Hi all, > > Having experienced this, it would be nice if there was an option to > offline the filesystem instead of kernel panicking on a per-zpool > basis. If its a system-critical partition like a database I''d prefer > it to kernel-panick and thereby trigger a fail-over of the > application. However, if its a zpool hosting some fileshares I''d > prefer it to stay online. Putting that level of control in would > alleviate a lot of the complaints it seems to me...or at least give > less of a leg to stand on. ;-)Agreed, and we are working on this. --matt
> If you take a look at these messages the somewhat unusual condition > that may lead to unexpected behaviour (ie. fast giveup) is that > whilst this is a SAN connection it is achieved through a non- > Leadville config, note the fibre-channel and sd references. In a > Leadville compliant installation this would be the ssd driver, hence > you''d have to investigate the specific semantics and driver tweaks > that this system has applied to sd in this instance.If only it was possible to use the Leadville drivers... We''ve seen the same problems here (*instant* panic if the FC switch reboots due to ZFS - I wouldn''t mind if it kept on retrying a tad bit longer - preferably configurable). And to panic? How can that in any sane way be good way to "protect" the application? *BANG* - no chance at all for the application to handle the problem... Btw. in our case we have also wrapped the raw FC-attached "disks" with SVM metadevices first because if a disk in a A3500FC units goes bad then we had the _other_ failure mode of ZFS - total hang until I noticed that by wrapping the device with a layer of SVM metadevices insulated ZFS from that problem - now it correctly notices that the disk is "gone/dead" and displays that when doing "zfs status" etc. (We (Lysator ACS - a students computer club) can''t use the Leadville driver, since the ''ifp" driver (and hence use the "ssd" disks) for the Qlogic QLA2100 HBA boards is based on an older Qlogic firmware that only supports max 16 LUNs per target and we want more... So we use the Qlogic qla2100 driver instead which works really nicely but then it uses the "sd" disk devices instead. Being a computer club with limited funds means one finds ways to use old hardware in new and interesting ways :-) Hardware in use: Primary file server: Sun Ultra 450, two Qlogic QLA2100 HBAs. One connected via an 8-port FC-AL *hub* to two Sun A5000 JBOD boxes (filled with 9 and 18GB FC disks), the other via a Brocade 2400 8-port switch (running in "QuickLoop" mode) to a Compaq StorageWorks RA8000 RAID and two A3500FC systems. Now... What can *possibly* go wrong with that setup? :-) I''ll tell you a couple: 1. When the server entered multiuser and started serving NFS to all the users $HOME - many many disks in the A5000 started resetting themself again and again and again... Solution: Tune down the maximum number of tagged commands that was sent to the disks in /kernel/drv/qla2100.conf: hba1-max-iocb-allocation=7; # was 256 hba1-execution-throttle=7; # was 31 (This problem wasn''t there with the old Sun "ifp" driver, probably because it has less agressive limits - but since that driver is totally nonconfigurable it''s impossible to tell). 2. The power cord got slightly lose to the Brocade switch causing it to reboot causing the server into an *Instant PANIC thanks to ZFS* This message posted from opensolaris.org
Any chance we might get a short refresher warning when creating a striped zpool? O:-) Best Regards, Jason On 12/4/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote:> Jason J. W. Williams wrote: > > Hi all, > > > > Having experienced this, it would be nice if there was an option to > > offline the filesystem instead of kernel panicking on a per-zpool > > basis. If its a system-critical partition like a database I''d prefer > > it to kernel-panick and thereby trigger a fail-over of the > > application. However, if its a zpool hosting some fileshares I''d > > prefer it to stay online. Putting that level of control in would > > alleviate a lot of the complaints it seems to me...or at least give > > less of a leg to stand on. ;-) > > Agreed, and we are working on this. > > --matt >
Peter Eriksson wrote:>> If you take a look at these messages the somewhat unusual condition >> that may lead to unexpected behaviour (ie. fast giveup) is that whilst >> this is a SAN connection it is achieved through a non- Leadville >> config, note the fibre-channel and sd references. In a Leadville >> compliant installation this would be the ssd driver, hence you''d have >> to investigate the specific semantics and driver tweaks that this >> system has applied to sd in this instance. > > If only it was possible to use the Leadville drivers... We''ve seen the > same problems here (*instant* panic if the FC switch reboots due to ZFS - > I wouldn''t mind if it kept on retrying a tad bit longer - preferably > configurable). And to panic? How can that in any sane way be good way to > "protect" the application? *BANG* - no chance at all for the application > to handle the problem...The *application* should not be worrying about handling error conditions in the kernel. That''s the kernel''s job, and in this case, ZFS'' job. ZFS protects *your data* by preventing any more writes from occurring when it cannot guarantee the integrity of your data.> Btw. in our case we have also wrapped the raw FC-attached "disks" with > SVM metadevices first because if a disk in a A3500FC units goes bad then > we had the _other_ failure mode of ZFS - total hang until I noticed that > by wrapping the device with a layer of SVM metadevices insulated ZFS from > that problem - now it correctly notices that the disk is "gone/dead" and > displays that when doing "zfs status" etc.Hm. An extra layer of complexity. Kinda defeats one of stated goals of ZFS.> (We (Lysator ACS - a students computer club) can''t use the Leadville > driver, since the ''ifp" driver (and hence use the "ssd" disks) for the > Qlogic QLA2100 HBA boards is based on an older Qlogic firmware that only > supports max 16 LUNs per target and we want more... So we use the Qlogic > qla2100 driver instead which works really nicely but then it uses the > "sd" disk devices instead. > Being a computer club with limited funds means one finds ways to use old > hardware in new and interesting ways :-)Ebay.se ?> Hardware in use: Primary file server: Sun Ultra 450, two Qlogic QLA2100 > HBAs. One connected via an 8-port FC-AL *hub* to two Sun A5000 JBOD boxes > (filled with 9 and 18GB FC disks), the other via a Brocade 2400 8-port > switch (running in "QuickLoop" mode) to a Compaq StorageWorks RA8000 RAID > and two A3500FC systems. > Now... What can *possibly* go wrong with that setup? :-)Hmmm.... let''s start with the mere existence of the EOL''d A3500fc hardware in your config. Kinda goes downhill from there :)> I''ll tell you a couple: > > 1. When the server entered multiuser and started serving NFS to all the > users $HOME - many many disks in the A5000 started resetting themself > again and again and again... Solution: Tune down the maximum number of > tagged commands that was sent to the disks in /kernel/drv/qla2100.conf: > hba1-max-iocb-allocation=7; # was 256 hba1-execution-throttle=7; # was 31 > (This problem wasn''t there with the old Sun "ifp" driver, probably > because it has less agressive limits - but since that driver is totally > nonconfigurable it''s impossible to tell).Ebay.se> 2. The power cord got slightly lose to the Brocade switch causing it to > reboot causing the server into an *Instant PANIC thanks to ZFS*Yes, as noted, this is by design in order to *protect your data* James C. McPherson -- Solaris kernel software engineer Sun Microsystems
Matthew Ahrens wrote:> Jason J. W. Williams wrote: >> Hi all, >> >> Having experienced this, it would be nice if there was an option to >> offline the filesystem instead of kernel panicking on a per-zpool >> basis. If its a system-critical partition like a database I''d prefer >> it to kernel-panick and thereby trigger a fail-over of the >> application. However, if its a zpool hosting some fileshares I''d >> prefer it to stay online. Putting that level of control in would >> alleviate a lot of the complaints it seems to me...or at least give >> less of a leg to stand on. ;-) > > Agreed, and we are working on this.Similar to UFS''s onerror mount option, I take it? /dale
> And to panic? How can that in any sane way be good > way to "protect" the application? > *BANG* - no chance at all for the application to > handle the problem...I agree -- a disk error should never be fatal to the system; at worst, the file system should appear to have been forcibly unmounted (and "worst" really means that critical metadata, like the superblock/uberblock, can''t be updated on any of the disks in the pool). That at least gives other applications which aren''t using the file system the chance to keep going. An I/O error detected when writing a file can be reported at write() time, fsync() time, or close() time. Any application which doesn''t check all three of these won''t handle all I/O errors properly; and applications which care about knowing that their data is on disk must either use synchronous writes (O_SYNC/O_DSYNC) or call fsync before closing the file. ZFS should report back these errors in all cases and avoid panicing (obviously). That said, it also appears that the device drivers (either the FibreChannel or SCSI disk drivers in this case) are misbehaving. The FC driver appears to be reporting back an error which is interpreted as fatal by the SCSI disk driver when one or the other should be retrying the I/O. (It also appears that either the FC driver, SCSI disk driver, or ZFS is misbehaving in the observed hang.) So ZFS should be more resilient against write errors, and the SCSI disk or FC drivers should be more resilient against LIPs (the most likely cause of your problem) or other transient errors. (Alternatively, the ifp driver should be updated to support the maximum number of targets on a loop, which might also solve your second problem.) This message posted from opensolaris.org
Anton B. Rang wrote: >Peter Eriksson wrote:>> And to panic? How can that in any sane way be good way to "protect" the >> application? *BANG* - no chance at all for the application to handle >> the problem... > > I agree -- a disk error should never be fatal to the system; at worst, > the file system should appear to have been forcibly unmounted (and > "worst" really means that critical metadata, like the > superblock/uberblock, can''t be updated on any of the disks in the pool). > That at least gives other applications which aren''t using the file system > the chance to keep going.But it''s still not the application''s problem to handle the underlying device failure. ...> That said, it also appears that the device drivers (either the > FibreChannel or SCSI disk drivers in this case) are misbehaving. The FC > driver appears to be reporting back an error which is interpreted as > fatal by the SCSI disk driver when one or the other should be retrying > the I/O. (It also appears that either the FC driver, SCSI disk driver, or > ZFS is misbehaving in the observed hang.)In this case it is most likely that it''s the qla2x00 driver which is at fault. The Leadville drivers do the appropriate retries. The sd driver and ZFS also do the appropriate retries.> So ZFS should be more resilient against write errors, and the SCSI disk > or FC drivers should be more resilient against LIPs (the most likely > cause of your problem) or other transient errors. (Alternatively, the ifp > driver should be updated to support the maximum number of targets on a > loop, which might also solve your second problem.)Your alternative option isn''t going to happen. The ifp driver and the card it supports have both been long since EOLd. James C. McPherson -- Solaris kernel software engineer Sun Microsystems
Anton B. Rang wrote:>> And to panic? How can that in any sane way be good >> way to "protect" the application? >> *BANG* - no chance at all for the application to >> handle the problem... > > I agree -- a disk error should never be fatal to the system; at worst, the file system > should appear to have been forcibly unmounted (and "worst" really means that critical > metadata, like the superblock/uberblock, can''t be updated on any of the disks in the > pool). That at least gives other applications which aren''t using the file system the > chance to keep going.This is not always the desired behavior. In particular, for a high availability cluster, if one node is having difficulty and another is not, then we''d really like to have the services relocated to the good node ASAP. I think this case is different, though...> An I/O error detected when writing a file can be reported at write() time, fsync() time, > or close() time. Any application which doesn''t check all three of these won''t handle > all I/O errors properly; and applications which care about knowing that their data is > on disk must either use synchronous writes (O_SYNC/O_DSYNC) or call fsync before > closing the file. ZFS should report back these errors in all cases and avoid panicing > (obviously).From what I recall of previous discussions on this topic (search the archives), the difficulty is attributing a failure temporally, given that you want a file system to have better performance by caching.> That said, it also appears that the device drivers (either the FibreChannel or SCSI > disk drivers in this case) are misbehaving. The FC driver appears to be reporting back > an error which is interpreted as fatal by the SCSI disk driver when one or the other > should be retrying the I/O. (It also appears that either the FC driver, SCSI disk > driver, or ZFS is misbehaving in the observed hang.)Agree 110%. When debugging layered software/firmware, it is essential to understand all of the assumptions made at all interfaces. Currently, ZFS assumes that a fatal write error is in fact fatal.> So ZFS should be more resilient against write errors, and the SCSI disk or FC drivers > should be more resilient against LIPs (the most likely cause of your problem) or other > transient errors. (Alternatively, the ifp driver should be updated to support the > maximum number of targets on a loop, which might also solve your second problem.)NB. LIPs are a normal part of everyday life for fibre channel, they are not an error. But I think Anton is right here, the way that the driver deals with incurred exceptions is key to the upper layers being stable. This can be tuned, but remember that tuning my lead to instability. We might be dealing with an instability case here, not a functional spec problem. -- richard
Dale Ghent wrote:> Matthew Ahrens wrote: >> Jason J. W. Williams wrote: >>> Hi all, >>> >>> Having experienced this, it would be nice if there was an option to >>> offline the filesystem instead of kernel panicking on a per-zpool >>> basis. If its a system-critical partition like a database I''d prefer >>> it to kernel-panick and thereby trigger a fail-over of the >>> application. However, if its a zpool hosting some fileshares I''d >>> prefer it to stay online. Putting that level of control in would >>> alleviate a lot of the complaints it seems to me...or at least give >>> less of a leg to stand on. ;-) >> >> Agreed, and we are working on this. > > Similar to UFS''s onerror mount option, I take it?Actually, it would be interesting to see how many customers change the onerror setting. We have some data, just need more days in the hour. -- richard
Richard Elling wrote:> Actually, it would be interesting to see how many customers change the > onerror setting. We have some data, just need more days in the hour.I''m pretty sure you''d find that info in over 6 years of submitted Explorer output :) I imagine that stuff is sandboxed away in a far off department, though. /dale
> But it''s still not the application''s problem to handle the underlying > device failure.But it is the application''s problem to handle an error writing to the file system -- that''s why the file system is allowed to return errors. ;-) Some applications might not check them, some applications might not have anything reasonable to do (though they can usually at least output a useful message to stderr), but other applications may be more robust. It''s not particularly uncommon for an application to encounter an error writing to volume X and then choose to write to volume Y instead; or to report the error back to another component or the end user. This message posted from opensolaris.org
>> So ZFS should be more resilient against write errors, and the SCSI disk or FC drivers >> should be more resilient against LIPs (the most likely cause of your problem) or other >> transient errors. (Alternatively, the ifp driver should be updated to support the >> maximum number of targets on a loop, which might also solve your second problem.) > > NB. LIPs are a normal part of everyday life for fibre channel, they are not an error.Right. I don''t think it''s the LIP''s that''s the problem but rather (a guess, not verified) the fact that the HBA loses "light" on it''s fiber interface when the switch reboots... I think I also saw the same ZFS-induced panic when I (stupid, I know, but...) moved a fiber cable from one GBIC in the switch to another "on the run". I also saw this with the "ifp" driver btw. And as someone wrote - the ifp driver will never be updated since it''s for EOL''d hardware :-) This message posted from opensolaris.org
Hmm... I just noticed this qla2100.conf option: # During link down conditions enable/disable the reporting of # errors. # 0 = disabled, 1 = enable hba0-link-down-error=1; hba1-link-down-error=1; I _wonder_ what might possibly happen if I change that 1 to a 0 (zero)... :-) This message posted from opensolaris.org
On 12/5/06, Peter Eriksson <peter at ifm.liu.se> wrote:> Hmm... I just noticed this qla2100.conf option: > > # During link down conditions enable/disable the reporting of > # errors. > # 0 = disabled, 1 = enable > hba0-link-down-error=1; > hba1-link-down-error=1;This is the driver the we are using in this configuration. Excellent insight. This will be added to the testing. Although, we are moving away from the Qlogic cards, back to the Sun branded Qlogic cards to use MPxIO. Which works flawlessly with UFS and raw drives. I wonder if using MPxIO, and dual connected / path HBAs would also reduce these errors.
Hello Richard, Tuesday, December 5, 2006, 7:01:17 AM, you wrote: RE> Dale Ghent wrote:>> >> Similar to UFS''s onerror mount option, I take it?RE> Actually, it would be interesting to see how many customers change the RE> onerror setting. We have some data, just need more days in the hour. Sometimes we do. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski wrote:> Hello Richard, > > Tuesday, December 5, 2006, 7:01:17 AM, you wrote: > > RE> Dale Ghent wrote: >>> Similar to UFS''s onerror mount option, I take it? > > RE> Actually, it would be interesting to see how many customers change the > RE> onerror setting. We have some data, just need more days in the hour. > > Sometimes we do.A preliminary look at a sample of the data shows that 1.6% do change this to something other than the default (panic). Though this is a statistically significant sample, it is skewed towards the high-end systems. A more detailed study would look at the instances where we had a problem, and the system did not panic. -- richard
> UFS will panic on EIO also. Most other file systems, too.In which cases will UFS panic on an I/O error? A quick browse through the UFS code shows several cases where we can panic if we have bad metadata on disk, but none if a disk read (or write) fails altogether. If UFS fails to read a block, it returns EIO (in most cases, occasionally a different error depending on the context) to its caller. (In a few cases, it can continue past the error; for instance, if it can''t read a cylinder group header and wants to allocate a block there, it will go on to a different cylinder group.) If UFS fails to write a block, the buffer cache or page cache will just keep retrying. QFS won''t even panic on bad metadata, unless enabled with an /etc/system variable; it will just returns errors to its caller. (It won''t panic on I/O errors at all.) --- As for why expectations with ZFS are higher? I suspect that it''s primarily because ZFS has been sold (deservedly) as being very good at dealing with hardware problems. This means that it should not only detect the problems, but continue on past them whenever possible. Ditto blocks are a first step in this direction. Bringing down the machine when a read or write fails is so 1980s; ZFS needs a bit of fine-tuning here. We don''t need to be defensive. ZFS is a new file system. It will take some time to work all the quirks out and it will take some time to eliminate all the panic cases. But we will. This message posted from opensolaris.org
At a minimum use the QLA2200 HBAs. As they were only recently EOLd. If you tried to give me a QLA2100 series HBA, I would not accept it. It''s 5 generations behind the current FC hardware. At least with a QLA2200 HBA you will get qlc support and MPXIO. Lyle This message posted from opensolaris.org