I''m looking at bringing up a new Solaris 10 based file server running off an older UltraSPARC-IIi 360MHz with 512mb ram. I''ve brought up the 11/06 release from scratch no patches installed at this time. I have 4 externally attached 36gb scsi devices off the hosts systems scsi bus. After setting up a few different zpool scenarios with mirrors, raidz, raidz2 to familiarize myself with the commands. I created some home directory like filesystems off the pool. I''m trying to simulate a drive failure by either powering down a single drive or physically removing it from it''s enclosure so as not to interrupt the scsi bus, however each time I do this and then attempt to access my zfs pool the system hangs and i get flooded with errors : Jan 23 14:49:13 foo scsi: WARNING: /pci at 1f,0/pci at 1/scsi at 1,1/sd at a,0 (sd24): Jan 23 14:49:13 foo disk not responding to selection eventually the system will freeze and i will have to go down to the eeprom level and issue a boot command to restart the host. Is this type of failure something it should be able to handle or am I doing something wrong and my expectations are too high here? Is this an issue with ZFS or more with the host system not being able to cope with a device being removed in this fashion. Also does anyone have an opinion based off the system I''m using whether this would be sufficient to go into production with assuming the errors i''m having can be addressed? This system would simply be an NFS server for home shares for approx 100 users. Thanks, -Jeff This message posted from opensolaris.org
Andrea Soliva
2007-Jan-25 10:46 UTC
[zfs-discuss] Re: Some questions I had while testing ZFS.
Hi I did the absolutly same test and have the same issue. I also posted a message. NO answer so far. Andrea This message posted from opensolaris.org
Anders Odberg
2007-Jan-25 12:39 UTC
[zfs-discuss] Re: Some questions I had while testing ZFS.
[Jeffrey Scott] | I''m trying to simulate a drive failure by either powering down a single | drive or physically removing it from it''s enclosure so as not to | interrupt the scsi bus, however each time I do this and then attempt to | access my zfs pool the system hangs and i get flooded with errors : | Jan 23 14:49:13 foo scsi: WARNING: /pci at 1f,0/pci at 1/scsi at 1,1/sd at a,0 (sd24): | Jan 23 14:49:13 foo disk not responding to selection | | eventually the system will freeze and i will have to go down to the | eeprom level and issue a boot command to restart the host. | | Is this type of failure something it should be able to handle or am I | doing something wrong and my expectations are too high here? | | Is this an issue with ZFS or more with the host system not being able | to cope with a device being removed in this fashion. I''ve seen similar problems with a T2000. If I create a mirror of the internal SAS-disks with "zpool create mirror foo c0t2d0 c0t3d0" and physically remove one of those disks, the system will hang completely after a short while, and I have to break the system from ALOM and reboot. If I do a "zpool offline" of the disk first, there are no problems when removing the disk. If I create a DiskSuite mirror, or a HW-raid mirror, on the disks instead, and then create a single-disk zpool on top of this mirror, there are no problems with the system or ZFS when I physically remove one of the disks in the mirror. I opened a support-case with Sun about this, and after a while I received a test-patch (IDR125057-01) which so far seems to have solved all my problems with this issue. If you have a support-contract with Sun, you could probably ask for this test-patch. I''ve not been told when it will make it into an official patch. Regards, -Anders. -- Anders Odberg, <anders.odberg at usit.uio.no> Center for Information Technology Services University of Oslo, Norway
Jeremy Teo
2007-Jan-25 12:58 UTC
[zfs-discuss] Re: Some questions I had while testing ZFS.
This is 6456939: sd_send_scsi_SYNCHRONIZE_CACHE_biodone() can issue TUR which calls biowait()and deadlock/hangs host http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6456939 (Thanks to Tpenta for digging this up) -- Regards, Jeremy