Hi, I am looking for some best practice advice on a project that i am working on. We are looking at migrating ~40TB backup data to ZFS, with an annual data growth of 20-25%. Now, my initial plan was to create one large pool comprised of X RAIDZ-2 vdevs ( 7 + 2 ) with one hotspare per 10 drives and just continue to expand that pool as needed. Between calculating the MTTDL and performance models i was hit by a rather scary thought. A pool comprised of X vdevs is no more resilient to data loss than the weakest vdev since loss of a vdev would render the entire pool unusable. This means that i potentially could loose 40TB+ of data if three disks within the same RAIDZ-2 vdev should die before the resilvering of at least one disk is complete. Since most disks will be filled i do expect rather long resilvering times. We are using 750 GB Seagate (Enterprise Grade) SATA disks for this project with as much hardware redundancy as we can get ( multiple controllers, dual cabeling, I/O multipathing, redundant PSUs, etc.) I could use multiple pools but that would make data management harder which in it self is a lengthy process in our shop. The MTTDL figures seem OK so how much should i need to worry ? Anyone having experience from this kind of setup ? /Don E. This message posted from opensolaris.org
Don Enrique wrote:> Now, my initial plan was to create one large pool comprised of X RAIDZ-2 vdevs ( 7 + 2 ) > with one hotspare per 10 drives and just continue to expand that pool as needed. > > Between calculating the MTTDL and performance models i was hit by a rather scary thought. > > A pool comprised of X vdevs is no more resilient to data loss than the weakest vdev since loss > of a vdev would render the entire pool unusable. > > This means that i potentially could loose 40TB+ of data if three disks within the same RAIDZ-2 > vdev should die before the resilvering of at least one disk is complete. Since most disks > will be filled i do expect rather long resilvering times.Why are you planning on using RAIDZ-2 rather than mirroring ? -- Darren J Moffat
> Don Enrique wrote: > > Now, my initial plan was to create one large pool > comprised of X RAIDZ-2 vdevs ( 7 + 2 ) > > with one hotspare per 10 drives and just continue > to expand that pool as needed. > > > > Between calculating the MTTDL and performance > models i was hit by a rather scary thought. > > > > A pool comprised of X vdevs is no more resilient to > data loss than the weakest vdev since loss > > of a vdev would render the entire pool unusable. > > > > This means that i potentially could loose 40TB+ of > data if three disks within the same RAIDZ-2 > > vdev should die before the resilvering of at least > one disk is complete. Since most disks > > will be filled i do expect rather long resilvering > times. > > Why are you planning on using RAIDZ-2 rather than > mirroring ?Mirroring would increase the cost significantly and is not within the budget of this project.> -- > Darren J Moffat > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ssThis message posted from opensolaris.org
On Thu, 3 Jul 2008, Don Enrique wrote:> > This means that i potentially could loose 40TB+ of data if three > disks within the same RAIDZ-2 vdev should die before the resilvering > of at least one disk is complete. Since most disks will be filled i > do expect rather long resilvering times.Yes, this risk always exists. The probability of three disks independently dying during the resilver is exceedingly low. The chance that your facility will be hit by an airplane during resilver is likely higher. However, it is true that RAIDZ-2 does not offer the same ease of control over physical redundancy that mirroring does. If you were to use 10 independent chassis and split the RAIDZ-2 uniformly across the chassis then the probability of a similar calamity impacting the same drives is driven by rack or facility-wide factors (e.g. building burning down) rather than shelf factors. However, if you had 10 RAID arrays mounted in the same rack and the rack falls over on its side during resilver then hope is still lost. I am not seeing any options for you here. ZFS RAIDZ-2 is about as good as it gets and if you want everything in one huge pool, there will be more risk. Perhaps there is a virtual filesystem layer which can be used on top of ZFS which emulates a larger filesystem but refuses to split files across pools. In the future it would be useful for ZFS to provide the option to not load-share across huge VDEVs and use VDEV-level space allocators. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Don Enrique wrote:> Hi, > > I am looking for some best practice advice on a project that i am working on. > > We are looking at migrating ~40TB backup data to ZFS, with an annual data growth of > 20-25%. > > Now, my initial plan was to create one large pool comprised of X RAIDZ-2 vdevs ( 7 + 2 ) > with one hotspare per 10 drives and just continue to expand that pool as needed. > > Between calculating the MTTDL and performance models i was hit by a rather scary thought. > > A pool comprised of X vdevs is no more resilient to data loss than the weakest vdev since loss > of a vdev would render the entire pool unusable. >Yes, but a raidz2 vdev using enterprise class disks is very reliable.> This means that i potentially could loose 40TB+ of data if three disks within the same RAIDZ-2 > vdev should die before the resilvering of at least one disk is complete. Since most disks > will be filled i do expect rather long resilvering times. > > We are using 750 GB Seagate (Enterprise Grade) SATA disks for this project with as much hardware > redundancy as we can get ( multiple controllers, dual cabeling, I/O multipathing, redundant PSUs, > etc.) >nit: SATA disks are single port, so you would need a SAS implementation to get multipathing to the disks. This will not significantly impact the overall availability of the data, however. I did an availability analysis of thumper to show this. http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_vs> I could use multiple pools but that would make data management harder which in it self is a lengthy > process in our shop. > > The MTTDL figures seem OK so how much should i need to worry ? Anyone having experience from > this kind of setup ? >I think your design is reasonable. We''d need to know the exact hardware details to be able to make more specific recommendations. -- richard
I''m going down a bit of a different path with my reply here. I know that all shops and their need for data are different, but hear me out. 1) You''re backing up 40TB+ of data, increasing at 20-25% per year. That''s insane. Perhaps it''s time to look at your backup strategy no from a hardware perspective, but from a data retention perspective. Do you really need that much data backed up? There has to be some way to get the volume down. If not, you''re at 100TB in just slightly over 4 years (assuming the 25% growth factor). If your data is critical, my recommendation is to go find another job and let someone else have that headache. 2) 40TB of backups is, at the best possible price, 50-1TB drives (for spares and such) - $12,500 for raw drive hardware. Enclosures add some money, as do cables and such. For mirroring, 90-1TB drives is $22,500 for the raw drives. In my world, I know yours is different, but the difference in a $100,000 solution and a $75,000 solution is pretty negligible. The short description here: you can afford to do mirrors. Really, you can. Any of the parity solutions out there, I don''t care what your strategy, is going to cause you more trouble than you''re ready to deal with. I know these aren''t solutions for you, it''s just the stuff that was in my head. The best possible solution, if you really need this kind of volume, is to create something that never has to resilver. Use some nifty combination of hardware and ZFS, like a couple of somethings that has 20TB per container exported as a single volume, mirror those with ZFS for its end-to-end checksumming and ease of management. That''s my considerably more than $0.02 On Thu, Jul 3, 2008 at 11:56 AM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Thu, 3 Jul 2008, Don Enrique wrote: > > > > This means that i potentially could loose 40TB+ of data if three > > disks within the same RAIDZ-2 vdev should die before the resilvering > > of at least one disk is complete. Since most disks will be filled i > > do expect rather long resilvering times. > > Yes, this risk always exists. The probability of three disks > independently dying during the resilver is exceedingly low. The chance > that your facility will be hit by an airplane during resilver is > likely higher. However, it is true that RAIDZ-2 does not offer the > same ease of control over physical redundancy that mirroring does. > If you were to use 10 independent chassis and split the RAIDZ-2 > uniformly across the chassis then the probability of a similar > calamity impacting the same drives is driven by rack or facility-wide > factors (e.g. building burning down) rather than shelf factors. > However, if you had 10 RAID arrays mounted in the same rack and the > rack falls over on its side during resilver then hope is still lost. > > I am not seeing any options for you here. ZFS RAIDZ-2 is about as > good as it gets and if you want everything in one huge pool, there > will be more risk. Perhaps there is a virtual filesystem layer which > can be used on top of ZFS which emulates a larger filesystem but > refuses to split files across pools. > > In the future it would be useful for ZFS to provide the option to not > load-share across huge VDEVs and use VDEV-level space allocators. > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- chris -at- microcozm -dot- net === Si Hoc Legere Scis Nimium Eruditionis Habes -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080703/5ae5be9b/attachment.html>
>>>>> "djm" == Darren J Moffat <darrenm at opensolaris.org> writes: >>>>> "bf" == Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes:djm> Why are you planning on using RAIDZ-2 rather than mirroring ? isn''t MTDL sometimes shorter for mirroring than raidz2? I think that is the biggest point of raidz2, is it not? bf> The probability of three disks independently dying during the bf> resilver The thing I never liked about MTDL models is their assuming disk failures are independent events. It seems likely to get a bad batch of disks if you buy a single model from a single manufacturer, and buy all the disks at the same time. They may have consecutive serial numbers, ship in the same box, u.s.w. You can design around marginal power supplies that feed a bank of disks with excessive ripple voltage, cause them all to write marginally readable data, and later make you think the disks all went bad at once. or use long fibre cables to put chassis in different rooms with separate aircon. or tell yourself other strange disaster stories and design around them. But fixing the lack of diversity in manufacturing and shipping seems hard. For my low-end stuff, I have been buying the two sides of mirrors from two companies, but I don''t know how workable that is for people trying to look ``professional''''. It''s also hard to do with raidz since there are so few hard drive brands left. Retailers ought to charge an extra markup for ``aging'''' the drives for you like cheese, and maintian several color-coded warehouses in which to do the aging: ``sell me 10 drives that were aged for six months in the Green warehouse.'''' -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080703/82117c8b/attachment.bin>
[Richard Elling] wrote:> Don Enrique wrote: >> Hi, >> >> I am looking for some best practice advice on a project that i am working on. >> >> We are looking at migrating ~40TB backup data to ZFS, with an annual data growth of >> 20-25%. >> >> Now, my initial plan was to create one large pool comprised of X RAIDZ-2 vdevs ( 7 + 2 ) >> with one hotspare per 10 drives and just continue to expand that pool as needed. >> >> Between calculating the MTTDL and performance models i was hit by a rather scary thought. >> >> A pool comprised of X vdevs is no more resilient to data loss than the weakest vdev since loss >> of a vdev would render the entire pool unusable. >> > > Yes, but a raidz2 vdev using enterprise class disks is very reliable.That''s nice to hear.>> This means that i potentially could loose 40TB+ of data if three disks within the same RAIDZ-2 >> vdev should die before the resilvering of at least one disk is complete. Since most disks >> will be filled i do expect rather long resilvering times. >> >> We are using 750 GB Seagate (Enterprise Grade) SATA disks for this project with as much hardware >> redundancy as we can get ( multiple controllers, dual cabeling, I/O multipathing, redundant PSUs, >> etc.) >> > > nit: SATA disks are single port, so you would need a SAS implementation > to get multipathing to the disks. This will not significantly impact the > overall availability of the data, however. I did an availability > analysis of > thumper to show this. > http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_vsYeah, I read your blog. Very informative indeed. I am using SAS HBA cards and SAS enclosures with SATA disks so I should be fine.>> I could use multiple pools but that would make data management harder which in it self is a lengthy >> process in our shop. >> >> The MTTDL figures seem OK so how much should i need to worry ? Anyone having experience from >> this kind of setup ? >> > > I think your design is reasonable. We''d need to know the exact > hardware details to be able to make more specific recommendations. > -- richardWell, my choice of hardware is kind of limited by 2 things : 1. We are a 100% Dell shop. 2. We already have lots of enclosures that i would like to reuse for my project. The HBA cards are SAS 5/E (LSI SAS1068 chipset) cards, the enclosures are Dell MD1000 diskarrays.>-- Med venlig hilsen / Best Regards Henrik Johansen henrik at myunix.dk
Miles Nordin wrote:>>>>>> "djm" == Darren J Moffat <darrenm at opensolaris.org> writes: >>>>>> "bf" == Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes: >>>>>> > > djm> Why are you planning on using RAIDZ-2 rather than mirroring ? > > isn''t MTDL sometimes shorter for mirroring than raidz2? I think that > is the biggest point of raidz2, is it not? >Yes. For some MTTDL models, a 3-way mirror is roughly equivalent to a 3-disk raidz2 set, with the mirror being slightly better because you do not require both of the other two disks to be functional during reconstruction. As the number of disks in the set increases, the MTTDL goes down, so a 4-disk raidz2 will have lower MTTDL than a 3-disk mirror. Somewhere I have graphs which show this... http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance> bf> The probability of three disks independently dying during the > bf> resilver > > The thing I never liked about MTDL models is their assuming disk > failures are independent events. It seems likely to get a bad batch > of disks if you buy a single model from a single manufacturer, and buy > all the disks at the same time. They may have consecutive serial > numbers, ship in the same box, u.s.w. >You are correct in that the models assume independent failures. Common failures for independent devices (eg vintages) can be modeled using an adjusted MTBF. For example, we sometimes see a vintage where the MTBF is statistically significantly different than other vintages. These can be difficult to predict and any such predictions may not help you make decisions. Somewhere I talk about that... http://blogs.sun.com/relling/entry/using_mtbf_and_time_dependent> You can design around marginal power supplies that feed a bank of > disks with excessive ripple voltage, cause them all to write > marginally readable data, and later make you think the disks all went > bad at once. or use long fibre cables to put chassis in different > rooms with separate aircon. or tell yourself other strange disaster > stories and design around them. But fixing the lack of diversity in > manufacturing and shipping seems hard. >My favorite is the guy who zip-ties the fiber in a tight wad at the back of the rack. Fiber (and copper) cables have a minimum bend radius specification. In fiber cables, small cracks can occur which, over time, become larger and cause attenuation. If you are really interested in diversity, you need to copy the data someplace far, far away, as many of the Katrina survivors learned. But even that might not be enough diversity... http://blogs.sun.com/relling/entry/diversity_revisited http://blogs.sun.com/relling/entry/diversity_in_your_connections> For my low-end stuff, I have been buying the two sides of mirrors from > two companies, but I don''t know how workable that is for people trying > to look ``professional''''. It''s also hard to do with raidz since there > are so few hard drive brands left. >I agree, and do the same.> Retailers ought to charge an extra markup for ``aging'''' the drives > for you like cheese, and maintian several color-coded warehouses in > which to do the aging: ``sell me 10 drives that were aged for six > months in the Green warehouse.'''' >I just looked at our field data for disks through last month and would say that aging won''t buy you any assurance. We are seeing excellent and improving reliability. Mind you, we are selling enterprise-class disks from the top bins :-) Meanwhile, thanks Miles for being a setup guy :-) -- richard
On Thu, 3 Jul 2008, Richard Elling wrote:> > nit: SATA disks are single port, so you would need a SAS > implementation to get multipathing to the disks. This will not > significantly impact the overall availability of the data, however. > I did an availability analysis of thumper to show this. > http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_vsRichard, It seems that the "Thumper" system (with 48 SATA drives) has been pretty well analyzed now. Is it possible for you to perform similar analysis of the new Sun Fire X4240 with its 16 SAS drives? SAS drives are usually faster than SATA drives and it is possible to multipath them (maybe not in this system?). This system seems ideal for ZFS and should work great as a medium-sized data server or database server. Maybe someone can run benchmarks on one and report the results here? Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Chris Cosby wrote:>I''m going down a bit of a different path with my reply here. I know that all >shops and their need for data are different, but hear me out. > >1) You''re backing up 40TB+ of data, increasing at 20-25% per year. That''s >insane. Perhaps it''s time to look at your backup strategy no from a hardware >perspective, but from a data retention perspective. Do you really need that >much data backed up? There has to be some way to get the volume down. If >not, you''re at 100TB in just slightly over 4 years (assuming the 25% growth >factor). If your data is critical, my recommendation is to go find another >job and let someone else have that headache.Well, we are talking about backup for ~900 servers that are in production. Our retention period is 14 days for stuff like web servers, and 3 weeks for SQL and such. We could deploy deduplication but it makes me a wee bit uncomfortable to blindly trust our backup software.>2) 40TB of backups is, at the best possible price, 50-1TB drives (for spares >and such) - $12,500 for raw drive hardware. Enclosures add some money, as do >cables and such. For mirroring, 90-1TB drives is $22,500 for the raw drives. >In my world, I know yours is different, but the difference in a $100,000 >solution and a $75,000 solution is pretty negligible. The short description >here: you can afford to do mirrors. Really, you can. Any of the parity >solutions out there, I don''t care what your strategy, is going to cause you >more trouble than you''re ready to deal with.Good point. I''ll take that into consideration.>I know these aren''t solutions for you, it''s just the stuff that was in my >head. The best possible solution, if you really need this kind of volume, is >to create something that never has to resilver. Use some nifty combination >of hardware and ZFS, like a couple of somethings that has 20TB per container >exported as a single volume, mirror those with ZFS for its end-to-end >checksumming and ease of management. > >That''s my considerably more than $0.02 > >On Thu, Jul 3, 2008 at 11:56 AM, Bob Friesenhahn < >bfriesen at simple.dallas.tx.us> wrote: > >> On Thu, 3 Jul 2008, Don Enrique wrote: >> > >> > This means that i potentially could loose 40TB+ of data if three >> > disks within the same RAIDZ-2 vdev should die before the resilvering >> > of at least one disk is complete. Since most disks will be filled i >> > do expect rather long resilvering times. >> >> Yes, this risk always exists. The probability of three disks >> independently dying during the resilver is exceedingly low. The chance >> that your facility will be hit by an airplane during resilver is >> likely higher. However, it is true that RAIDZ-2 does not offer the >> same ease of control over physical redundancy that mirroring does. >> If you were to use 10 independent chassis and split the RAIDZ-2 >> uniformly across the chassis then the probability of a similar >> calamity impacting the same drives is driven by rack or facility-wide >> factors (e.g. building burning down) rather than shelf factors. >> However, if you had 10 RAID arrays mounted in the same rack and the >> rack falls over on its side during resilver then hope is still lost. >> >> I am not seeing any options for you here. ZFS RAIDZ-2 is about as >> good as it gets and if you want everything in one huge pool, there >> will be more risk. Perhaps there is a virtual filesystem layer which >> can be used on top of ZFS which emulates a larger filesystem but >> refuses to split files across pools. >> >> In the future it would be useful for ZFS to provide the option to not >> load-share across huge VDEVs and use VDEV-level space allocators. >> >> Bob >> =====================================>> Bob Friesenhahn >> bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ >> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > > >-- >chris -at- microcozm -dot- net >=== Si Hoc Legere Scis Nimium Eruditionis Habes-- Med venlig hilsen / Best Regards Henrik Johansen henrik at myunix.dk
Chris Cosby <ccosby+zfs <at> gmail.com> writes:> > > You''re backing up 40TB+ of data, increasing at 20-25% per year. > That''s insane.Over time, backing up his data will require _fewer_ and fewer disks. Disk sizes increase by about 40% every year. -marc