Hi all, yesterday we had a drive failure on a fc-al jbod with 14 drives. Suddenly the zpool using that jbod stopped to respond to I/O requests and we get tons of the following messages on /var/adm/messages: Sep 3 15:20:10 fb2 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk at g20000004cfd81b9f (sd52): Sep 3 15:20:10 fb2 SCSI transport failed: reason ''timeout'': giving up "cfgadm -al" or "devfsadm -C" didn''t solve the problem. After a reboot ZFS recognized the drive as failed and all worked well. Do we need to restart Solaris after a drive failure?? Gino This message posted from opensolaris.org
I''m going to go out on a limb here and say you have an A5000 with the 1.6" disks in it. Because of their design, (all drives seeing each other on both the A and B loops), it''s possible for one disk that is behaving badly to take over the FC-AL loop and require human intervention. You can physically go up to the A5000 and remove the faulty drive if your volume manager software (SVM, VxVM, ZFS, etc) can still run without the drive. In the above case the WWN (ending in 81b9f) is printed on the label so it''s easy to locate the faulty drive. Keep in mind sometimes the /next/ functioning drive in the loop can be the reporting one sometimes. It''s just a quirk of that storage unit. These days devices will usually have an individual internal FC-AL loop to each drive to alleviate this sort of problem. Cheers, Mark.> Hi all, > > yesterday we had a drive failure on a fc-al jbod with 14 drives. > Suddenly the zpool using that jbod stopped to respond to I/O requests and we get tons of the following messages on /var/adm/messages: > > Sep 3 15:20:10 fb2 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk at g20000004cfd81b9f (sd52): > Sep 3 15:20:10 fb2 SCSI transport failed: reason ''timeout'': giving up > > "cfgadm -al" or "devfsadm -C" didn''t solve the problem. > After a reboot ZFS recognized the drive as failed and all worked well. > > Do we need to restart Solaris after a drive failure?? > > Gino
Hi Mark, the drive (147GB, FC 2Gb) failed on a Xyratex JBOD. Also in the past we had the same problem with a drive failed on a EMC CX JBOD. Anyway I can''t understand why rebooting Solaris solved out the situation .. Thank you, Gino This message posted from opensolaris.org
On 9/4/07, Gino <ginoruopolo at hotmail.com> wrote:> yesterday we had a drive failure on a fc-al jbod with 14 drives. > Suddenly the zpool using that jbod stopped to respond to I/O requests > and we get tons of the following messages on /var/adm/messages:<snip>> "cfgadm -al" or "devfsadm -C" didn''t solve the problem. > After a reboot ZFS recognized the drive as failed and all worked well. > > Do we need to restart Solaris after a drive failure??I would hope not but ... prior to putting some ZFS volumes into production we did some failure testing. The hardware I was testing with was a couple SF-V245 with 4 x 72 GB disks each. Two disks were setup with SVM/UFS as mirrored OS, the other two were handed to ZFS as a mirrored zpool. I did some large file copies to generate I/O. While a large copy was going on (lots of disk I/O) I pulled one of the drives. If the I/O was to the zpool the system would hang (just like it was hung waiting on an I/O operation). I let it sit this way for over an hour with no recovery. After rebooting it found the existing half of the ZFS mirror just fine. Just to be clear, once I pulled the disk, over about a 5 minute period *all* activity on the box hung. Even a shell just running prstat. If the I/O was to one of the SVM/UFS disks there would be a 60-90 second pause in all activity (just like the ZFS case), but then operation would resume. This is what I am used to seeing for a disk failure. In the ZFS case I could replace the disk and the zpool would resilver automatically. I could also take the removed disk and put it into the second system and have it recognize the zpool (and that it was missing half of a mirror) and the data was all there. In no case did I see any data loss or corruption. I had attributed the system hanging to an interaction between the SAS and ZFS layers, but the previous post makes me question that assumption. As another data point, I have an old Intel box at home I am running x86 on with ZFS. I have a pair of 120 GB PATA disks. OS is on SVM/UFS mirrored partitions and /export home is on a pair of partitions in a zpool (mirror). I had a bad power connector and sometime after booting lost one of the drives. The server kept running fine. Once I got the drive powered back up (while the server was shut down), the SVM mirrors resync''d and the zpool resilvered. The zpool finished substantially before the SVM. In all cases the OS was Solaris 10 U 3 (11/06) with no additional patches. -- Paul Kraus
Paul Kraus wrote:> On 9/4/07, Gino <ginoruopolo at hotmail.com> wrote: > >> yesterday we had a drive failure on a fc-al jbod with 14 drives. >> Suddenly the zpool using that jbod stopped to respond to I/O requests >> and we get tons of the following messages on /var/adm/messages: > > <snip> > >> "cfgadm -al" or "devfsadm -C" didn''t solve the problem. >> After a reboot ZFS recognized the drive as failed and all worked well. >> >> Do we need to restart Solaris after a drive failure??It depends...> I would hope not but ... prior to putting some ZFS volumes > into production we did some failure testing. The hardware I was > testing with was a couple SF-V245 with 4 x 72 GB disks each. Two disks > were setup with SVM/UFS as mirrored OS, the other two were handed to > ZFS as a mirrored zpool. I did some large file copies to generate I/O. > While a large copy was going on (lots of disk I/O) I pulled one of the > drives.... on which version of Solaris you are running. ZFS FMA phase 2 was integrated into SXCE build 68. Prior to that release, ZFS had a limited view of the (many) disk failure modes -- it would say a disk was failed if it could not be opened. In phase 2, the ZFS diagnosis engine was enhanced to look for per-vdev soft error rate discriminator (SERD) engines. More details can be found in the ARC case materials: http://www.opensolaris.org/os/community/arc/caselog/2007/283/materials/portfolio-txt/ In SXCE build 72 we gain a new FMA I/O retire agent. This is more general purpose and allows a process to set a contract against a device in use. http://www.opensolaris.org/os/community/on/flag-days/pages/2007080901/ http://www.opensolaris.org/os/community/arc/caselog/2007/290/> If the I/O was to the zpool the system would hang (just like > it was hung waiting on an I/O operation). I let it sit this way for > over an hour with no recovery. After rebooting it found the existing > half of the ZFS mirror just fine. Just to be clear, once I pulled the > disk, over about a 5 minute period *all* activity on the box hung. > Even a shell just running prstat.It may depend on what shell you are using. Some shells, such as ksh write to the $HISTFILE before exec''ing the command. If your $HISTFILE was located in an affected file system, then you would appear hung.> If the I/O was to one of the SVM/UFS disks there would be a > 60-90 second pause in all activity (just like the ZFS case), but then > operation would resume. This is what I am used to seeing for a disk > failure.Default retries to most disks are 60 seconds (last time I checked). There are several layers involved here, so you can expect something to happen on 60 second intervals, even if it is just another retry.> In the ZFS case I could replace the disk and the zpool would > resilver automatically. I could also take the removed disk and put it > into the second system and have it recognize the zpool (and that it > was missing half of a mirror) and the data was all there. > > In no case did I see any data loss or corruption. I had > attributed the system hanging to an interaction between the SAS and > ZFS layers, but the previous post makes me question that assumption. > > As another data point, I have an old Intel box at home I am > running x86 on with ZFS. I have a pair of 120 GB PATA disks. OS is on > SVM/UFS mirrored partitions and /export home is on a pair of > partitions in a zpool (mirror). I had a bad power connector and > sometime after booting lost one of the drives. The server kept running > fine. Once I got the drive powered back up (while the server was shut > down), the SVM mirrors resync''d and the zpool resilvered. The zpool > finished substantially before the SVM. > > In all cases the OS was Solaris 10 U 3 (11/06) with no > additional patches.The behaviour you describe is what I would expect for that release of Solaris + ZFS. -- richard
> >> "cfgadm -al" or "devfsadm -C" didn''t solve the > problem. > >> After a reboot ZFS recognized the drive as failed > and all worked well. > >> > >> Do we need to restart Solaris after a drive > failure?? > > It depends... > ... on which version of Solaris you are running. ZFS > FMA phase 2 was > integrated into SXCE build 68. Prior to that > release, ZFS had a limited > view of the (many) disk failure modes -- it would say > a disk was failed > if it could not be opened. In phase 2, the ZFS > diagnosis engine was > enhanced to look for per-vdev soft error rate > discriminator (SERD) engines.Richard, thank you for your detailed reply. Unfortunately an other reason to stay with UFS in production .. Gino This message posted from opensolaris.org
Gino wrote:>>>> "cfgadm -al" or "devfsadm -C" didn''t solve the >>>> >> problem. >> >>>> After a reboot ZFS recognized the drive as failed >>>> >> and all worked well. >> >>>> Do we need to restart Solaris after a drive >>>> >> failure?? >> >> It depends... >> ... on which version of Solaris you are running. ZFS >> FMA phase 2 was >> integrated into SXCE build 68. Prior to that >> release, ZFS had a limited >> view of the (many) disk failure modes -- it would say >> a disk was failed >> if it could not be opened. In phase 2, the ZFS >> diagnosis engine was >> enhanced to look for per-vdev soft error rate >> discriminator (SERD) engines. >> > > Richard, thank you for your detailed reply. > Unfortunately an other reason to stay with UFS in production .. > >IMHO, maturity is the primary reason to stick with UFS. To look at this through the maturity lens, UFS is the great grandfather living on life support (prune juice and oxygen) while ZFS is the late adolescent, soon to bloom into a young adult. The torch will pass when ZFS becomes the preferred root file system. -- richard
> > > > Richard, thank you for your detailed reply. > > Unfortunately an other reason to stay with UFS in > production .. > > > > > IMHO, maturity is the primary reason to stick with > UFS. To look at > this through the maturity lens, UFS is the great > grandfather living on > life support (prune juice and oxygen) while ZFS is > the late adolescent, > soon to bloom into a young adult. The torch will pass > when ZFS > becomes the preferred root file system. > -- richardI agree with you but don''t understand why Sun has integrated ZFS on Solaris and declared it as stable. Sun Sales tell you to trash your old redundant arrays and go with jbod and ZFS... but don''t tell you that you probably will need to reboot your SF25k because of a disk failure!! :( Gino This message posted from opensolaris.org
Gino wrote:>>> Richard, thank you for your detailed reply. >>> Unfortunately an other reason to stay with UFS in >> production .. >>> >> IMHO, maturity is the primary reason to stick with >> UFS. To look at >> this through the maturity lens, UFS is the great >> grandfather living on >> life support (prune juice and oxygen) while ZFS is >> the late adolescent, >> soon to bloom into a young adult. The torch will pass >> when ZFS >> becomes the preferred root file system. >> -- richard > > I agree with you but don''t understand why Sun has integrated ZFS on Solaris and declared it as stable. > Sun Sales tell you to trash your old redundant arrays and go with jbod and ZFS... > but don''t tell you that you probably will need to reboot your SF25k because of a disk failure!! :(To put this in perspective, no system on the planet today handles all faults. I would even argue that building such a system is theoretically impossible. So the subset of faults which ZFS covers which is different than the subset that UFS covers and different than what SVM covers. For example, we *know* that ZFS has allowed people to detect and recover from faulty SAN switches, borken RAID arrays, and accidental deletions which UFS could have never even detected. There are some known gaps which are being closed in ZFS, but it is simply not the case that UFS is superior in all RAS respects to ZFS. -- richard
> To put this in perspective, no system on the planet > today handles all faults. > I would even argue that building such a system is > theoretically impossible.no doubt about that ;)> So the subset of faults which ZFS covers which is > different than the subset > that UFS covers and different than what SVM covers. > For example, we *know* > hat ZFS has allowed people to detect and recover from > faulty SAN switches, > borken RAID arrays, and accidental deletions which > UFS could have never even > detected. There are some known gaps which are being > closed in ZFS, but it is > simply not the case that UFS is superior in all RAS > respects to ZFS.I agree ZFS features are outstanding BUT from my point of view ZFS has been integrated on Solaris too early, without too much testing. Just a few examples: -We lost several zpool with S10U3 because of "spacemap" bug, and -nothing- was recoverable. No fsck here :( -We had tons of kernel panics because of ZFS. Here a "reboot" must be planned with a couple of weeks in advance and done only at saturday night .. -Our 9900V and HP EVAs works really BAD with ZFS because of large cache. (echo zfs_nocacheflush/W 1 | mdb -kw) did not solve the problem. Only helped a bit. -ZFS performs badly with a lot of small files. (about 20 times slower that UFS with our millions file rsync procedures) -ZFS+FC JBOD: failed hard disk need a reboot :((((((((( (frankly unbelievable in 2007!) Anyway we happily use ZFS on our new backup systems (snapshotting with ZFS is amazing), but to tell you the true we are keeping 2 large zpool in sync on each system because we fear an other zpool corruption. Many friends of mine working on big Solaris environments moved to ZFS with S10U3 and than soon went back with UFS because of the same problems. Sure, for our home server with cheap ata drives ZFS is unbeatable and free :) Gino This message posted from opensolaris.org
On Tue, 2007-09-11 at 13:43 -0700, Gino wrote:> -ZFS+FC JBOD: failed hard disk need a reboot :((((((((( > (frankly unbelievable in 2007!)So, I''ve been using ZFS with some creaky old FC JBODs (A5200''s) and old disks which have been failing regularly and haven''t seen that; the worst I''ve seen running nevada was that processes touching the pool got stuck, but they all came unstuck when I powered off the at-fault FC disk via the A5200 front panel.
Bill Sommerfeld wrote:> On Tue, 2007-09-11 at 13:43 -0700, Gino wrote: >> -ZFS+FC JBOD: failed hard disk need a reboot :((((((((( >> (frankly unbelievable in 2007!) > > So, I''ve been using ZFS with some creaky old FC JBODs (A5200''s) and old > disks which have been failing regularly and haven''t seen that; the worst > I''ve seen running nevada was that processes touching the pool got stuck, > but they all came unstuck when I powered off the at-fault FC disk via > the A5200 front panel.Yes, this is a case where the disk has not completely failed. ZFS seems to handle the completely failed disk case properly, and has for a long time. Cutting the power (which you can also do with luxadm) makes the disk appear completely failed. -- richard
On 9/11/07, Gino <ginoruopolo at hotmail.com> wrote:> -ZFS performs badly with a lot of small files. > (about 20 times slower that UFS with our millions file rsync procedures)We have seen just the opposite... we have a server with about 40 million files and only 4 TB of data. We have been benchmarking FSes for creation and manipulation of large populations of small files and ZFS is the only one we have found that continues to scale linearly above one million files in one FS. UFS, VXFS, HFS+ (don''t ask why), NSS (on NW not Linux) all show exponential growth in response time as you cross a certain knee (we are graphing time to create <n> zero length files, then do a series of basic manipulations on them) in number of files. For all the FSes we have tested that knee has been under one million files, except for ZFS. I know this is not ''real world'' but it does reflect the response time issues we have been trying to solve. I will see if my client (I am a consultant) will allow me to post the results, as I am under NDA for most of the details of what we are doing. On the other hand, we have seen serious issues using rsync to migrate this data from the existing server to the Solaris 10 / ZFS system, so perhaps your performance issues were rsync related and not ZFS. In fact, so far the fastest and most reliable method for moving the data is proving to be Veritas NetBackup (back it up on the source server, restore to the new ZFS server). Now having said all that, we are probably never going to see 100 million files in one zpool, because the ZFS architecture lets us use a more distributed model (many zpools and datasets within them) and still present the end users with a single view of all the data. -- Paul Kraus
Gino wrote: [...]> Just a few examples: > -We lost several zpool with S10U3 because of "spacemap" bug, > and -nothing- was recoverable. No fsck here :( > >Yes, I criticized the lack of zpool recovery mechanisms, too, during my AVS testing. But I don''t have the know-how to judge if it has technical reasons.> -We had tons of kernel panics because of ZFS. > Here a "reboot" must be planned with a couple of weeks in advance > and done only at saturday night .. >Well, I''m sorry, but if your datacenter runs into problems when a single server isn''t available, you probably have much worse problems. ZFS is a file system. It''s not a substitute for hardware trouble or a misplanned infrastructure. What would you do if you had the fsck you mentioned earlier? Or with another file system like UFS, ext3, whatever? Boot a system into single user mode and fsck several terabytes, after planning it a couple of weeks in advance?> -Our 9900V and HP EVAs works really BAD with ZFS because of large cache. > (echo zfs_nocacheflush/W 1 | mdb -kw) did not solve the problem. Only helped a bit. > >Use JBODs. Or tell the cache controllers to ignore the flushing requests. Should be possible, even the $10k low-cost StorageTek arrays support this.> -ZFS performs badly with a lot of small files. > (about 20 times slower that UFS with our millions file rsync procedures) > >I have large Sybase database servers and file servers with billions of inodes running using ZFSv3. They are attached to X4600 boxes running Solaris 10 U3, 2x 4 GBit/s dual FibreChannel, using dumb and cheap Infortrend FC JBODs (2 GBit/s) as storage shelves. During all my benchmarks (both on the command line and within applications) show that the FibreChannel is the bottleneck, even with random read. ZFS doesn''t do this out of the box, but a bit of tuning helped a lot.> -ZFS+FC JBOD: failed hard disk need a reboot :((((((((( > (frankly unbelievable in 2007!) >No. Read the thread carefully. It was mentioned that you don''t have to reboot the server, all you need to do is pull the hard disk. Shouldn''t be a problem, except if you don''t want to replace the faulty one anyway. No other manual operations will be necessary, except for the final "zfs replace". You could also try cfgadm to get rid of ZFS pool problems, perhaps it works - I''m not sure about this, because I had the idea *after* I solved that problem, but I''ll give it a try someday.> Anyway we happily use ZFS on our new backup systems (snapshotting with ZFS is amazing), but to tell you the true we are keeping 2 large zpool in sync on each system because we fear an other zpool corruption. > >May I ask how you accomplish that? And why are you doing this? You should replicate your zpool to another host, instead of mirroring locally. Where''s your redundancy in that? -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 ralf.ramge at webde.de - http://web.de/ 1&1 Internet AG Brauerstra?e 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren
> On Tue, 2007-09-11 at 13:43 -0700, Gino wrote: > > -ZFS+FC JBOD: failed hard disk need a reboot > :((((((((( > > (frankly unbelievable in 2007!) > > So, I''ve been using ZFS with some creaky old FC JBODs > (A5200''s) and old > disks which have been failing regularly and haven''t > seen that; the worst > I''ve seen running nevada was that processes touching > the pool got stuck,this is the problem> but they all came unstuck when I powered off the > at-fault FC disk via > the A5200 front panel.I''ll try again with the EMC JBOD but anyway still remain the fact that you need to manually recover from an hard disk failure. gino This message posted from opensolaris.org
> Yes, this is a case where the disk has not completely > failed. > ZFS seems to handle the completely failed disk case > properly, and > has for a long time. Cutting the power (which you > can also do with > luxadm) makes the disk appear completely failed.Richard, I think you''re right. The failed disk is still working but it has no space for bad sectors... Gino This message posted from opensolaris.org
> We have seen just the opposite... we have a > server with about > 0 million files and only 4 TB of data. We have been > benchmarking FSes > for creation and manipulation of large populations of > small files and > ZFS is the only one we have found that continues to > scale linearly > above one million files in one FS. UFS, VXFS, HFS+ > (don''t ask why), > NSS (on NW not Linux) all show exponential growth in > response time as > you cross a certain knee (we are graphing time to > create <n> zero > length files, then do a series of basic manipulations > on them) in > number of files. For all the FSes we have tested that > knee has been > under one million files, except for ZFS. I know this > is not ''real > world'' but it does reflect the response time issues > we have been > trying to solve. I will see if my client (I am a > consultant) will > allow me to post the results, as I am under NDA for > most of the > details of what we are doing.It would be great!> On the other hand, we have seen serious > issues using rsync to > migrate this data from the existing server to the > Solaris 10 / ZFS > system, so perhaps your performance issues were rsync > related and not > ZFS. In fact, so far the fastest and most reliable > method for moving > the data is proving to be Veritas NetBackup (back it > up on the source > server, restore to the new ZFS server). > > Now having said all that, we are probably > never going to see > 00 million files in one zpool, because the ZFS > architecture lets us > use a more distributed model (many zpools and > datasets within them) > and still present the end users with a single view of > all the data.Hi Paul, may I ask you your medium file size? Have you done some optimization? ZFS recordsize? Your test included also writing 1 million files? Gino This message posted from opensolaris.org
> > -We had tons of kernel panics because of ZFS. > > Here a "reboot" must be planned with a couple of > weeks in advance > > and done only at saturday night .. > > > Well, I''m sorry, but if your datacenter runs into > problems when a single > server isn''t available, you probably have much worse > problems. ZFS is a > file system. It''s not a substitute for hardware > trouble or a misplanned > infrastructure. What would you do if you had the fsck > you mentioned > earlier? Or with another file system like UFS, ext3, > whatever? Boot a > system into single user mode and fsck several > terabytes, after planning > it a couple of weeks in advance?For example we have a couple of apps using 80-290GB RAM. Some thousands users. We use Solaris+Sparc+High end storage because we can''t afford downtimes. We can deal with a failed file system. A reboot during the day would cost a lot of money. The real problem is that ZFS should stop to force kernel panics.> > -Our 9900V and HP EVAs works really BAD with ZFS > because of large cache. > > (echo zfs_nocacheflush/W 1 | mdb -kw) did not solve > the problem. Only helped a bit. > > > > > Use JBODs. Or tell the cache controllers to ignore > the flushing > requests. Should be possible, even the $10k low-cost > StorageTek arrays > support this.Unfortunately HP EVA can''t do it. About the 9900V, it is really fast (64GB cache helps a lot) end reliable. 100% uptime in years. We''ll never touch it to solve a ZFS problem. We started using JBOD (12x16drive shelfs) with ZFS but speed and reliability (today) is not comparable to HDS+UFS.> > -ZFS performs badly with a lot of small files. > > (about 20 times slower that UFS with our millions > file rsync procedures) > > > > > I have large Sybase database servers and file servers > with billions of > inodes running using ZFSv3. They are attached to > X4600 boxes running > Solaris 10 U3, 2x 4 GBit/s dual FibreChannel, using > dumb and cheap > Infortrend FC JBODs (2 GBit/s) as storage shelves.Are you using FATA drives?> During all my > benchmarks (both on the command line and within > applications) show that > the FibreChannel is the bottleneck, even with random > read. ZFS doesn''t > do this out of the box, but a bit of tuning helped a > lot.You found and other good point. I think that with ZFS and JBOD, FC links will be soon the bottleneck. What tuning have you done?> > -ZFS+FC JBOD: failed hard disk need a reboot > :((((((((( > > (frankly unbelievable in 2007!) > > > No. Read the thread carefully. It was mentioned that > you don''t have to > reboot the server, all you need to do is pull the > hard disk. Shouldn''t > be a problem, except if you don''t want to replace the > faulty one anyway.It is a problem if your apps hangs waiting for you to power down/pull out the drive! Almost in a time=money environment :)> No other manual operations will be necessary, except > for the final "zfs > replace". You could also try cfgadm to get rid of ZFS > pool problems, > perhaps it works - I''m not sure about this, because I > had the idea > *after* I solved that problem, but I''ll give it a try > someday. > > Anyway we happily use ZFS on our new backup systems > (snapshotting with ZFS is amazing), but to tell you > the true we are keeping 2 large zpool in sync on each > system because we fear an other zpool corruption. > > > > > May I ask how you accomplish that?During the day we sync pool1 with pool2, then we ?umount pool2" during sheduled backup operations at night.> And why are you doing this? You should replicate your > zpool to another > host, instead of mirroring locally. Where''s your > redundancy in that?We have 4 backup hosts. Soon we''ll move to 10G network and we''ll replicate on different hosts, as you pointed out. Gino This message posted from opensolaris.org
Gino wrote:> The real problem is that ZFS should stop to force kernel panics. > >I found these panics very annoying, too. And even more that the zpool was faulted afterwards. But my problem is that when someone asks me what ZFS should do instead, I have no idea.>> I have large Sybase database servers and file servers >> with billions of >> inodes running using ZFSv3. They are attached to >> X4600 boxes running >> Solaris 10 U3, 2x 4 GBit/s dual FibreChannel, using >> dumb and cheap >> Infortrend FC JBODs (2 GBit/s) as storage shelves. >> > > Are you using FATA drives? > >Seagate FibreChannel drives, Cheetah 15k, ST3146855FC for the databases. For the NFS filers we use Infortrend FC shelves with SATA inside.>> During all my >> benchmarks (both on the command line and within >> applications) show that >> the FibreChannel is the bottleneck, even with random >> read. ZFS doesn''t >> do this out of the box, but a bit of tuning helped a >> lot. >> > > You found and other good point. > I think that with ZFS and JBOD, FC links will be soon the bottleneck. > What tuning have you done? > >That depends on the indivdual requirements of each service. Basically, we change to recordsize according to the transaction size of the databases and, on the filers, the performance results were best when the recordsize was a bit lower than the average file size (average file size is 12K, so I set a recordsize of 8K). I set a vdev cache size of 8K and our databases worked best with a vq_max_pending of 32. ZFSv3 was used, that''s the version which is shipped with Solaris 10 11/06.> It is a problem if your apps hangs waiting for you to power down/pull out the drive! > Almost in a time=money environment :) > >Yes, but why doesn''t your application fail over to a standby? I''m also working in a "time is money and failure no option" environment, and I doubt I would sleep better if I were responsible for an application under such a service level agreement without full high availability. If a system reboot can be a single point of failure, what about the network infrastructure? Hardware errors? Or power outages? I''m definitely NOT some kind of know-it-all, don''t misunderstand me. Your statement just let my alarm bells ring and that''s why I''m asking. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 ralf.ramge at webde.de - http://web.de/ 1&1 Internet AG Brauerstra?e 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren
> Gino wrote: > > The real problem is that ZFS should stop to force > kernel panics. > > > I found these panics very annoying, too. And even > more that the zpool > was faulted afterwards. But my problem is that when > someone asks me what > ZFS should do instead, I have no idea.well, what about just hang processes waiting for I/O on that zpool? Could be possible?> Seagate FibreChannel drives, Cheetah 15k, ST3146855FC > for the databases.What king of JBOD for that drives? Just to know ... We found Xyratex''s to be good products.> That depends on the indivdual requirements of each > service. Basically, > we change to recordsize according to the transaction > size of the > databases and, on the filers, the performance results > were best when the > recordsize was a bit lower than the average file size > (average file size > is 12K, so I set a recordsize of 8K). I set a vdev > cache size of 8K and > our databases worked best with a vq_max_pending of > 32. ZFSv3 was used, > that''s the version which is shipped with Solaris 10 > 11/06.thanks for sharing.> Yes, but why doesn''t your application fail over to a > standby?It is a little complex to explain. Basically that apps are making a lot of "number cruncing" on some a very big data in ram. Failover would be starting again from the beginning, with all the customers waiting for hours (and loosing money). We are working on a new app, capable to work with a couple of nodes but it will takes some months to be in beta, then 2 years of testing ...> a system reboot can be a single point of failure, > what about the network > infrastructure? Hardware errors? Or power outages?We use Sunfire for that reason. We had 2 cpu failures and no service interruption, the same for 1 dimm module (we have been lucky with cpu failures ;)). HDS raid arrays are excellent about availability. Lots of fc links, network links .. All this is in a fully redundant datacenter .. and, sure, we have a stand by system on a disaster recovery site (hope to never use it!).> I''m definitely NOT some kind of know-it-all, don''t > misunderstand me. > Your statement just let my alarm bells ring and > that''s why I''m asking.Don''t worry Ralf. Any suggestion/opinion/critic is welcome. It''s a pleasure to exchange our experience Gino This message posted from opensolaris.org
Wade.Stuart at fallon.com
2007-Sep-12 14:15 UTC
[zfs-discuss] I/O freeze after a disk failure
zfs-discuss-bounces at opensolaris.org wrote on 09/12/2007 08:04:33 AM:> > Gino wrote: > > > The real problem is that ZFS should stop to force > > kernel panics. > > > > > I found these panics very annoying, too. And even > > more that the zpool > > was faulted afterwards. But my problem is that when > > someone asks me what > > ZFS should do instead, I have no idea. > > well, what about just hang processes waiting for I/O on that zpool? > Could be possible?It seems that maybe there is too large a code path leading to panics -- maybe a side effect of ZFS being "new" (compared to other filesystems). I would hope that as these panic issues are coming up that the code path leading to the panic is evaluated for a specific fix or behavior code path. Sometimes it does make sense to panic (if there _will_ be data damage if you continue). Other times not.> > > Seagate FibreChannel drives, Cheetah 15k, ST3146855FC > > for the databases. > > What king of JBOD for that drives? Just to know ... > We found Xyratex''s to be good products. > > > That depends on the indivdual requirements of each > > service. Basically, > > we change to recordsize according to the transaction > > size of the > > databases and, on the filers, the performance results > > were best when the > > recordsize was a bit lower than the average file size > > (average file size > > is 12K, so I set a recordsize of 8K). I set a vdev > > cache size of 8K and > > our databases worked best with a vq_max_pending of > > 32. ZFSv3 was used, > > that''s the version which is shipped with Solaris 10 > > 11/06. > > thanks for sharing. > > > Yes, but why doesn''t your application fail over to a > > standby? > > It is a little complex to explain. Basically that apps are making a > lot of "number cruncing" on some a very big data in ram. Failover > would be starting again from the beginning, with all the customers > waiting for hours (and loosing money). > We are working on a new app, capable to work with a couple of nodes > but it will takes some months to be in beta, then 2 years of testing ... > > > a system reboot can be a single point of failure, > > what about the network > > infrastructure? Hardware errors? Or power outages? > > We use Sunfire for that reason. We had 2 cpu failures and no service > interruption, the same for 1 dimm module (we have been lucky with > cpu failures ;)). > HDS raid arrays are excellent about availability. Lots of fc links, > network links .. > All this is in a fully redundant datacenter .. and, sure, we have a > stand by system on a disaster recovery site (hope to never use it!).I can understand where you are coming from as far as the need for uptime and loss of money on that app server. Two years of testing for the app, Sunfire servers for N+1 because the app can''t be clustered and you have chosen to run a filesystem that has just been made public? ZFS may be great and all, but this stinks of running a .0 version on the production machine. VXFS+snap has well known and documented behaviors tested for years on production machines. Why did you even choose to run ZFS on that specific box? Do not get me wrong, I really like many things about ZFS -- it is ground breaking. I still do not get why it would be chosen for a server in that position until it has better real world production testing and modeling. You have taken all of the buildup you have done and introduced an unknown to the mix.> > > I''m definitely NOT some kind of know-it-all, don''t > > misunderstand me. > > Your statement just let my alarm bells ring and > > that''s why I''m asking. > > Don''t worry Ralf. Any suggestion/opinion/critic is welcome. > It''s a pleasure to exchange our experience > > Gino > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> It seems that maybe there is too large a code path > leading to panics -- > maybe a side effect of ZFS being "new" (compared to > other filesystems). I > would hope that as these panic issues are coming up > that the code path > leading to the panic is evaluated for a specific fix > or behavior code path. > Sometimes it does make sense to panic (if there > _will_ be data damage if > you continue). Other times not.I think the same about panics. So, IMHO, ZFS should not be called "stable". But you know ... marketing ... ;)> I can understand where you are coming from as > far as the need for > ptime and loss of money on that app server. Two years > of testing for the > app, Sunfire servers for N+1 because the app can''t be > clustered and you > have chosen to run a filesystem that has just been > made public?What? That server is running and will be running on UFS for many years! Upgrading, patching, cleaning ... even touching it is strictly prohibited :) We upgraded to S10 because of DTrace (helped us a lot) and during the test phase we evaluated also ZFS. Now we only use ZFS for our central backup servers (for many applications, systems, customers, ...) We also manage a lot of other systems and always try to migrate customers to Solaris because of stability, resource control, DTrace .. but found ZFS disappointing at today (probably tomorrow it will be THE filesystem). Gino This message posted from opensolaris.org
> . . . >> Use JBODs. Or tell the cache controllers to ignore >> the flushing requests.ginoruopolo at hotmail.com said:> Unfortunately HP EVA can''t do it. About the 9900V, it is really fast (64GB > cache helps a lot) end reliable. 100% uptime in years. We''ll never touch it > to solve a ZFS problem.On our low-end HDS array (9520V), turning on "Synchronize Cache Invalid Mode" did the trick for ZFS purposes (Solaris-10U3). They''ve since added a Solaris kernel tunable in /etc/system: set zfs:zfs_nocacheflush = 1 This has the unfortunate side-effect of disabling it on all disks for the whole system, though. ZFS is getting more mature all the time.... Regards, Marion
> Paul Kraus wrote: > > In the ZFS case I could replace the disk > and the zpool would > > resilver automatically. I could also take the > removed disk and put it > > into the second system and have it recognize the > zpool (and that it > > was missing half of a mirror) and the data was all > there. > > > > In no case did I see any data loss or > corruption. I had > > attributed the system hanging to an interaction > between the SAS and > > ZFS layers, but the previous post makes me question > that assumption. > > > > As another data point, I have an old Intel > box at home I am > > running x86 on with ZFS. I have a pair of 120 GB > PATA disks. OS is on > > SVM/UFS mirrored partitions and /export home is on > a pair of > > partitions in a zpool (mirror). I had a bad power > connector and > > sometime after booting lost one of the drives. The > server kept running > > fine. Once I got the drive powered back up (while > the server was shut > > down), the SVM mirrors resync''d and the zpool > resilvered. The zpool > > finished substantially before the SVM. > > > > In all cases the OS was Solaris 10 U 3 > (11/06) with no > > additional patches. > > The behaviour you describe is what I would expect for > that release of > Solaris + ZFS.It seems this is fixed in SXCE, do you know if some of the fixes made it into 10_U4? Thanks, Paul This message posted from opensolaris.org