thr3ads.net - zfs discuss - [zfs-discuss] Large zpool design considerations [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Don Enrique

2008-Jul-03 11:42 UTC

[zfs-discuss] Large zpool design considerations

Hi,

I am looking for some best practice advice on a project that i am working on.

We are looking at migrating ~40TB backup data to ZFS, with an annual data growth
of
20-25%.

Now, my initial plan was to create one large pool comprised of X RAIDZ-2 vdevs (
7 + 2 )
with one hotspare per 10 drives and just continue to expand that pool as needed.

Between calculating the MTTDL and performance models i was hit by a rather scary
thought.

A pool comprised of X vdevs is no more resilient to data loss than the weakest
vdev since loss
of a vdev would render the entire pool unusable.

This means that i potentially could loose 40TB+ of data if three disks within
the same RAIDZ-2
vdev should die before the resilvering of at least one disk is complete. Since
most disks
will be filled i do expect rather long resilvering times.

We are using 750 GB Seagate (Enterprise Grade) SATA disks for this project with
as much hardware
redundancy as we can get ( multiple controllers, dual cabeling, I/O
multipathing, redundant PSUs,
etc.)

I could use multiple pools but that would make data management harder which in
it self is a lengthy
process in our shop.

The MTTDL figures seem OK so how much should i need to worry ? Anyone having
experience from
this kind of setup ?

/Don E.
 
 
This message posted from opensolaris.org

Darren J Moffat

2008-Jul-03 12:35 UTC

head link

[zfs-discuss] Large zpool design considerations

Don Enrique wrote:> Now, my initial plan was to create one large pool comprised of X RAIDZ-2
vdevs ( 7 + 2 )
> with one hotspare per 10 drives and just continue to expand that pool as
needed.
> 
> Between calculating the MTTDL and performance models i was hit by a rather
scary thought.
> 
> A pool comprised of X vdevs is no more resilient to data loss than the
weakest vdev since loss
> of a vdev would render the entire pool unusable.
> 
> This means that i potentially could loose 40TB+ of data if three disks
within the same RAIDZ-2
> vdev should die before the resilvering of at least one disk is complete.
Since most disks
> will be filled i do expect rather long resilvering times.
Why are you planning on using RAIDZ-2 rather than mirroring ?

-- 
Darren J Moffat

Don Enrique

2008-Jul-03 12:42 UTC

head link

[zfs-discuss] Large zpool design considerations

> Don Enrique wrote:
> > Now, my initial plan was to create one large pool
> comprised of X RAIDZ-2 vdevs ( 7 + 2 )
> > with one hotspare per 10 drives and just continue
> to expand that pool as needed.
> > 
> > Between calculating the MTTDL and performance
> models i was hit by a rather scary thought.
> > 
> > A pool comprised of X vdevs is no more resilient to
> data loss than the weakest vdev since loss
> > of a vdev would render the entire pool unusable.
> > 
> > This means that i potentially could loose 40TB+ of
> data if three disks within the same RAIDZ-2
> > vdev should die before the resilvering of at least
> one disk is complete. Since most disks
> > will be filled i do expect rather long resilvering
> times.
> 
> Why are you planning on using RAIDZ-2 rather than
> mirroring ?
Mirroring would increase the cost significantly and is not within the budget of
this project.
> -- 
> Darren J Moffat
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss 
 
This message posted from opensolaris.org

Bob Friesenhahn

2008-Jul-03 15:56 UTC

head link

[zfs-discuss] Large zpool design considerations

On Thu, 3 Jul 2008, Don Enrique wrote:>
> This means that i potentially could loose 40TB+ of data if three 
> disks within the same RAIDZ-2 vdev should die before the resilvering 
> of at least one disk is complete. Since most disks will be filled i 
> do expect rather long resilvering times.
Yes, this risk always exists.  The probability of three disks 
independently dying during the resilver is exceedingly low. The chance 
that your facility will be hit by an airplane during resilver is 
likely higher.  However, it is true that RAIDZ-2 does not offer the 
same ease of control over physical redundancy that mirroring does. 
If you were to use 10 independent chassis and split the RAIDZ-2 
uniformly across the chassis then the probability of a similar 
calamity impacting the same drives is driven by rack or facility-wide 
factors (e.g. building burning down) rather than shelf factors. 
However, if you had 10 RAID arrays mounted in the same rack and the 
rack falls over on its side during resilver then hope is still lost.

I am not seeing any options for you here.  ZFS RAIDZ-2 is about as 
good as it gets and if you want everything in one huge pool, there 
will be more risk.  Perhaps there is a virtual filesystem layer which 
can be used on top of ZFS which emulates a larger filesystem but 
refuses to split files across pools.

In the future it would be useful for ZFS to provide the option to not 
load-share across huge VDEVs and use VDEV-level space allocators.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2008-Jul-03 16:31 UTC

head link

[zfs-discuss] Large zpool design considerations

Don Enrique wrote:> Hi,
>
> I am looking for some best practice advice on a project that i am working
on.
>
> We are looking at migrating ~40TB backup data to ZFS, with an annual data
growth of
> 20-25%.
>
> Now, my initial plan was to create one large pool comprised of X RAIDZ-2
vdevs ( 7 + 2 )
> with one hotspare per 10 drives and just continue to expand that pool as
needed.
>
> Between calculating the MTTDL and performance models i was hit by a rather
scary thought.
>
> A pool comprised of X vdevs is no more resilient to data loss than the
weakest vdev since loss
> of a vdev would render the entire pool unusable.
>   
Yes, but a raidz2 vdev using enterprise class disks is very reliable.
> This means that i potentially could loose 40TB+ of data if three disks
within the same RAIDZ-2
> vdev should die before the resilvering of at least one disk is complete.
Since most disks
> will be filled i do expect rather long resilvering times.
>
> We are using 750 GB Seagate (Enterprise Grade) SATA disks for this project
with as much hardware
> redundancy as we can get ( multiple controllers, dual cabeling, I/O
multipathing, redundant PSUs,
> etc.)
>   
nit: SATA disks are single port, so you would need a SAS implementation
to get multipathing to the disks.  This will not significantly impact the
overall availability of the data, however.  I did an availability 
analysis of
thumper to show this.
http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_vs
> I could use multiple pools but that would make data management harder which
in it self is a lengthy
> process in our shop.
>
> The MTTDL figures seem OK so how much should i need to worry ? Anyone
having experience from
> this kind of setup ?
>   
I think your design is reasonable.  We''d need to know the exact
hardware details to be able to make more specific recommendations.
 -- richard

Chris Cosby

2008-Jul-03 17:42 UTC

head link

[zfs-discuss] Large zpool design considerations

I''m going down a bit of a different path with my reply here. I know
that all
shops and their need for data are different, but hear me out.

1) You''re backing up 40TB+ of data, increasing at 20-25% per year.
That''s
insane. Perhaps it''s time to look at your backup strategy no from a
hardware
perspective, but from a data retention perspective. Do you really need that
much data backed up? There has to be some way to get the volume down. If
not, you''re at 100TB in just slightly over 4 years (assuming the 25%
growth
factor). If your data is critical, my recommendation is to go find another
job and let someone else have that headache.

2) 40TB of backups is, at the best possible price, 50-1TB drives (for spares
and such) - $12,500 for raw drive hardware. Enclosures add some money, as do
cables and such. For mirroring, 90-1TB drives is $22,500 for the raw drives.
In my world, I know yours is different, but the difference in a $100,000
solution and a $75,000 solution is pretty negligible. The short description
here: you can afford to do mirrors. Really, you can. Any of the parity
solutions out there, I don''t care what your strategy, is going to cause
you
more trouble than you''re ready to deal with.

I know these aren''t solutions for you, it''s just the stuff
that was in my
head. The best possible solution, if you really need this kind of volume, is
to create something that never has to resilver. Use some nifty combination
of hardware and ZFS, like a couple of somethings that has 20TB per container
exported as a single volume, mirror those with ZFS for its end-to-end
checksumming and ease of management.

That''s my considerably more than $0.02

On Thu, Jul 3, 2008 at 11:56 AM, Bob Friesenhahn <
bfriesen at simple.dallas.tx.us> wrote:
> On Thu, 3 Jul 2008, Don Enrique wrote:
> >
> > This means that i potentially could loose 40TB+ of data if three
> > disks within the same RAIDZ-2 vdev should die before the resilvering
> > of at least one disk is complete. Since most disks will be filled i
> > do expect rather long resilvering times.
>
> Yes, this risk always exists.  The probability of three disks
> independently dying during the resilver is exceedingly low. The chance
> that your facility will be hit by an airplane during resilver is
> likely higher.  However, it is true that RAIDZ-2 does not offer the
> same ease of control over physical redundancy that mirroring does.
> If you were to use 10 independent chassis and split the RAIDZ-2
> uniformly across the chassis then the probability of a similar
> calamity impacting the same drives is driven by rack or facility-wide
> factors (e.g. building burning down) rather than shelf factors.
> However, if you had 10 RAID arrays mounted in the same rack and the
> rack falls over on its side during resilver then hope is still lost.
>
> I am not seeing any options for you here.  ZFS RAIDZ-2 is about as
> good as it gets and if you want everything in one huge pool, there
> will be more risk.  Perhaps there is a virtual filesystem layer which
> can be used on top of ZFS which emulates a larger filesystem but
> refuses to split files across pools.
>
> In the future it would be useful for ZFS to provide the option to not
> load-share across huge VDEVs and use VDEV-level space allocators.
>
> Bob
> =====================================> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

-- 
chris -at- microcozm -dot- net
=== Si Hoc Legere Scis Nimium Eruditionis Habes
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080703/5ae5be9b/attachment.html>

Miles Nordin

2008-Jul-03 18:34 UTC

head link

[zfs-discuss] Large zpool design considerations

>>>>> "djm" == Darren J Moffat <darrenm at
opensolaris.org> writes:
>>>>> "bf" == Bob Friesenhahn <bfriesen at
simple.dallas.tx.us> writes:
djm> Why are you planning on using RAIDZ-2 rather than mirroring ?

isn''t MTDL sometimes shorter for mirroring than raidz2? I think that
is the biggest point of raidz2, is it not?

bf> The probability of three disks independently dying during the
bf> resilver

The thing I never liked about MTDL models is their assuming disk
failures are independent events. It seems likely to get a bad batch
of disks if you buy a single model from a single manufacturer, and buy
all the disks at the same time. They may have consecutive serial
numbers, ship in the same box, u.s.w.

You can design around marginal power supplies that feed a bank of
disks with excessive ripple voltage, cause them all to write
marginally readable data, and later make you think the disks all went
bad at once. or use long fibre cables to put chassis in different
rooms with separate aircon. or tell yourself other strange disaster
stories and design around them. But fixing the lack of diversity in
manufacturing and shipping seems hard.

For my low-end stuff, I have been buying the two sides of mirrors from
two companies, but I don''t know how workable that is for people trying
to look ``professional''''. It''s also hard to do with
raidz since there
are so few hard drive brands left.

Retailers ought to charge an extra markup for ``aging'''' the
drives
for you like cheese, and maintian several color-coded warehouses in
which to do the aging: ``sell me 10 drives that were aged for six
months in the Green warehouse.''''
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080703/82117c8b/attachment.bin>

Henrik Johansen

2008-Jul-03 19:19 UTC

head link

[zfs-discuss] Large zpool design considerations

[Richard Elling] wrote:> Don Enrique wrote:
>> Hi,
>>
>> I am looking for some best practice advice on a project that i am
working on.
>>
>> We are looking at migrating ~40TB backup data to ZFS, with an annual
data growth of
>> 20-25%.
>>
>> Now, my initial plan was to create one large pool comprised of X
RAIDZ-2 vdevs ( 7 + 2 )
>> with one hotspare per 10 drives and just continue to expand that pool
as needed.
>>
>> Between calculating the MTTDL and performance models i was hit by a
rather scary thought.
>>
>> A pool comprised of X vdevs is no more resilient to data loss than the
weakest vdev since loss
>> of a vdev would render the entire pool unusable.
>>   
>
> Yes, but a raidz2 vdev using enterprise class disks is very reliable.
That''s nice to hear.
>> This means that i potentially could loose 40TB+ of data if three disks
within the same RAIDZ-2
>> vdev should die before the resilvering of at least one disk is
complete. Since most disks
>> will be filled i do expect rather long resilvering times.
>>
>> We are using 750 GB Seagate (Enterprise Grade) SATA disks for this
project with as much hardware
>> redundancy as we can get ( multiple controllers, dual cabeling, I/O
multipathing, redundant PSUs,
>> etc.)
>>   
>
> nit: SATA disks are single port, so you would need a SAS implementation
> to get multipathing to the disks.  This will not significantly impact the
> overall availability of the data, however.  I did an availability  
> analysis of
> thumper to show this.
> http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_vs
Yeah, I read your blog. Very informative indeed. 

I am using SAS HBA cards and SAS enclosures with SATA disks so I should
be fine.
>> I could use multiple pools but that would make data management harder
which in it self is a lengthy
>> process in our shop.
>>
>> The MTTDL figures seem OK so how much should i need to worry ? Anyone
having experience from
>> this kind of setup ?
>>   
>
> I think your design is reasonable.  We''d need to know the exact
> hardware details to be able to make more specific recommendations.
> -- richard
Well, my choice of hardware is kind of limited by 2 things :

1. We are a 100% Dell shop.
2. We already have lots of enclosures that i would like to reuse for my project.

The HBA cards are SAS 5/E (LSI SAS1068 chipset) cards, the enclosures are
Dell MD1000 diskarrays.
>
-- 
Med venlig hilsen / Best Regards

Henrik Johansen
henrik at myunix.dk

Richard Elling

2008-Jul-03 19:26 UTC

head link

[zfs-discuss] Large zpool design considerations

Miles Nordin wrote:>>>>>> "djm" == Darren J Moffat <darrenm at
opensolaris.org> writes:
>>>>>> "bf" == Bob Friesenhahn <bfriesen at
simple.dallas.tx.us> writes:
>>>>>>             
>
>    djm> Why are you planning on using RAIDZ-2 rather than mirroring ?
>
> isn''t MTDL sometimes shorter for mirroring than raidz2?  I think
that
> is the biggest point of raidz2, is it not?
>   
Yes.  For some MTTDL models, a 3-way mirror is roughly equivalent to a
3-disk raidz2 set, with the mirror being slightly better because you do 
not require
both of the other two disks to be functional during reconstruction.  As 
the number
of disks in the set increases, the MTTDL goes down, so a 4-disk raidz2 
will have
lower MTTDL than a 3-disk mirror.  Somewhere I have graphs which show 
this...
http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance
>     bf> The probability of three disks independently dying during the
>     bf> resilver
>
> The thing I never liked about MTDL models is their assuming disk
> failures are independent events.  It seems likely to get a bad batch
> of disks if you buy a single model from a single manufacturer, and buy
> all the disks at the same time.  They may have consecutive serial
> numbers, ship in the same box, u.s.w.
>   
You are correct in that the models assume independent failures.  Common
failures for independent devices (eg vintages) can be modeled using an
adjusted MTBF.  For example, we sometimes see a vintage where the
MTBF is statistically significantly different than other vintages.  These
can be difficult to predict and any such predictions may not help you
make decisions.  Somewhere I talk about that...
http://blogs.sun.com/relling/entry/using_mtbf_and_time_dependent
> You can design around marginal power supplies that feed a bank of
> disks with excessive ripple voltage, cause them all to write
> marginally readable data, and later make you think the disks all went
> bad at once.  or use long fibre cables to put chassis in different
> rooms with separate aircon.  or tell yourself other strange disaster
> stories and design around them.  But fixing the lack of diversity in
> manufacturing and shipping seems hard.
>   
My favorite is the guy who zip-ties the fiber in a tight wad at the back
of the rack.  Fiber (and copper) cables have a minimum bend radius
specification.  In fiber cables, small cracks can occur which, over time,
become larger and cause attenuation.

If you are really interested in diversity, you need to copy the data
someplace far, far away, as many of the Katrina survivors learned.
But even that might not be enough diversity...
http://blogs.sun.com/relling/entry/diversity_revisited
http://blogs.sun.com/relling/entry/diversity_in_your_connections> For my low-end stuff, I have been buying the two sides of mirrors from
> two companies, but I don''t know how workable that is for people
trying
> to look ``professional''''.  It''s also hard to do
with raidz since there
> are so few hard drive brands left.  
>   
I agree, and do the same.
> Retailers ought to charge an extra markup for ``aging''''
the drives
> for you like cheese, and maintian several color-coded warehouses in
> which to do the aging: ``sell me 10 drives that were aged for six
> months in the Green warehouse.''''
>   
I just looked at our field data for disks through last month and
would say that aging won''t buy you any assurance.  We are
seeing excellent and improving reliability.  Mind you, we are
selling enterprise-class disks from the top bins :-)

Meanwhile, thanks Miles for being a setup guy :-)
 -- richard

Bob Friesenhahn

2008-Jul-03 19:37 UTC

head link

[zfs-discuss] Large zpool design considerations

On Thu, 3 Jul 2008, Richard Elling wrote:>
> nit: SATA disks are single port, so you would need a SAS 
> implementation to get multipathing to the disks.  This will not 
> significantly impact the overall availability of the data, however. 
> I did an availability analysis of thumper to show this. 
> http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_vs
Richard,

It seems that the "Thumper" system (with 48 SATA drives) has been 
pretty well analyzed now.  Is it possible for you to perform similar 
analysis of the new Sun Fire X4240 with its 16 SAS drives?  SAS drives 
are usually faster than SATA drives and it is possible to multipath 
them (maybe not in this system?).

This system seems ideal for ZFS and should work great as a 
medium-sized data server or database server.  Maybe someone can run 
benchmarks on one and report the results here?

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Henrik Johansen

2008-Jul-04 07:13 UTC

head link

[zfs-discuss] Large zpool design considerations

Chris Cosby wrote:>I''m going down a bit of a different path with my reply here. I know
that all
>shops and their need for data are different, but hear me out.
>
>1) You''re backing up 40TB+ of data, increasing at 20-25% per year.
That''s
>insane. Perhaps it''s time to look at your backup strategy no from a
hardware
>perspective, but from a data retention perspective. Do you really need that
>much data backed up? There has to be some way to get the volume down. If
>not, you''re at 100TB in just slightly over 4 years (assuming the
25% growth
>factor). If your data is critical, my recommendation is to go find another
>job and let someone else have that headache.
Well, we are talking about backup for ~900 servers that are in
production. Our retention period is 14 days for stuff like web servers,
and 3 weeks for SQL and such. 

We could deploy deduplication but it makes me a wee bit uncomfortable to
blindly trust our backup software.
>2) 40TB of backups is, at the best possible price, 50-1TB drives (for spares
>and such) - $12,500 for raw drive hardware. Enclosures add some money, as do
>cables and such. For mirroring, 90-1TB drives is $22,500 for the raw drives.
>In my world, I know yours is different, but the difference in a $100,000
>solution and a $75,000 solution is pretty negligible. The short description
>here: you can afford to do mirrors. Really, you can. Any of the parity
>solutions out there, I don''t care what your strategy, is going to
cause you
>more trouble than you''re ready to deal with.
Good point. I''ll take that into consideration.
>I know these aren''t solutions for you, it''s just the stuff
that was in my
>head. The best possible solution, if you really need this kind of volume, is
>to create something that never has to resilver. Use some nifty combination
>of hardware and ZFS, like a couple of somethings that has 20TB per container
>exported as a single volume, mirror those with ZFS for its end-to-end
>checksumming and ease of management.
>
>That''s my considerably more than $0.02
>
>On Thu, Jul 3, 2008 at 11:56 AM, Bob Friesenhahn <
>bfriesen at simple.dallas.tx.us> wrote:
>
>> On Thu, 3 Jul 2008, Don Enrique wrote:
>> >
>> > This means that i potentially could loose 40TB+ of data if three
>> > disks within the same RAIDZ-2 vdev should die before the
resilvering
>> > of at least one disk is complete. Since most disks will be filled
i
>> > do expect rather long resilvering times.
>>
>> Yes, this risk always exists.  The probability of three disks
>> independently dying during the resilver is exceedingly low. The chance
>> that your facility will be hit by an airplane during resilver is
>> likely higher.  However, it is true that RAIDZ-2 does not offer the
>> same ease of control over physical redundancy that mirroring does.
>> If you were to use 10 independent chassis and split the RAIDZ-2
>> uniformly across the chassis then the probability of a similar
>> calamity impacting the same drives is driven by rack or facility-wide
>> factors (e.g. building burning down) rather than shelf factors.
>> However, if you had 10 RAID arrays mounted in the same rack and the
>> rack falls over on its side during resilver then hope is still lost.
>>
>> I am not seeing any options for you here.  ZFS RAIDZ-2 is about as
>> good as it gets and if you want everything in one huge pool, there
>> will be more risk.  Perhaps there is a virtual filesystem layer which
>> can be used on top of ZFS which emulates a larger filesystem but
>> refuses to split files across pools.
>>
>> In the future it would be useful for ZFS to provide the option to not
>> load-share across huge VDEVs and use VDEV-level space allocators.
>>
>> Bob
>> =====================================>> Bob Friesenhahn
>> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
>> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>
>
>
>-- 
>chris -at- microcozm -dot- net
>=== Si Hoc Legere Scis Nimium Eruditionis Habes
-- 
Med venlig hilsen / Best Regards

Henrik Johansen
henrik at myunix.dk

Marc Bevand

2008-Jul-04 09:41 UTC

head link

[zfs-discuss] Large zpool design considerations

Chris Cosby <ccosby+zfs <at> gmail.com>
writes:> 
> 
> You''re backing up 40TB+ of data, increasing at 20-25% per year.
> That''s insane.
Over time, backing up his data will require _fewer_ and fewer disks.
Disk sizes increase by about 40% every year.

-marc

zfs discuss - Jul 2008 - Large zpool design considerations

[zfs-discuss] Large zpool design considerations

[zfs-discuss] Large zpool design considerations

[zfs-discuss] Large zpool design considerations

[zfs-discuss] Large zpool design considerations

[zfs-discuss] Large zpool design considerations

[zfs-discuss] Large zpool design considerations

[zfs-discuss] Large zpool design considerations

[zfs-discuss] Large zpool design considerations

[zfs-discuss] Large zpool design considerations

[zfs-discuss] Large zpool design considerations

[zfs-discuss] Large zpool design considerations

[zfs-discuss] Large zpool design considerations