thr3ads.net - zfs discuss - [zfs-discuss] Recomandations [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Paul Piscuc

2010-Nov-28 21:51 UTC

[zfs-discuss] Recomandations

Hi,

We are a company that want to replace our current  storage layout with one
that uses ZFS. We have been testing it for a month now, and everything looks
promising. One element that we cannot determine is the optimum number of
disks in a raid-z pool. In the ZFS best practice guide, 7,9 and 11 disks
are recommended to be used in a single raid-z2.  On the other hand, another
user specifies that the most important part is the distribution of the
defaul 128k record size to all the disks. So, the recommended layout would
be:

4-disk RAID-Z2 = 128KiB / 2 = 64KiB = good
5-disk RAID-Z2 = 128KiB / 3 = ~43KiB = not good
6-disk RAID-Z2 = 128KiB / 4 = 32KiB = good
10-disk RAID-Z2 = 128KiB / 8 = 16KiB = good

What is your recommendations regarding the number of disks? We are planning
to use 2 raid-z2 pools with 8+2 disks, 2 spare, 2 SSDs for L2ARC, 2 SSDs for
ZIL, 2 for syspool, and a similar machine for replication.

Thanks in advance,
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101128/db44e0b8/attachment.html>

Erik Trimble

2010-Nov-29 04:03 UTC

head link

[zfs-discuss] Recomandations

On 11/28/2010 1:51 PM, Paul Piscuc wrote:> Hi,
>
> We are a company that want to replace our current  storage layout with 
> one that uses ZFS. We have been testing it for a month now, and 
> everything looks promising. One element that we cannot determine is 
> the optimum number of disks in a raid-z pool. In the ZFS best practice 
> guide, 7,9 and 11 disks are recommended to be used in a single 
> raid-z2.  On the other hand, another user specifies that the most 
> important part is the distribution of the defaul 128k record size to 
> all the disks. So, the recommended layout would be:
>
> 4-disk RAID-Z2 = 128KiB / 2 = 64KiB = good
> 5-disk RAID-Z2 = 128KiB / 3 = ~43KiB = not good
> 6-disk RAID-Z2 = 128KiB / 4 = 32KiB = good
> 10-disk RAID-Z2 = 128KiB / 8 = 16KiB = good
>
> What is your recommendations regarding the number of disks? We are 
> planning to use 2 raid-z2 pools with 8+2 disks, 2 spare, 2 SSDs for 
> L2ARC, 2 SSDs for ZIL, 2 for syspool, and a similar machine for 
> replication.
>
> Thanks in advance,
>
You''ve hit on one of the hardest parts of using ZFS - optimization.   
Truth of the matter is that there is NO one-size-fits-all "best" 
solution. It heavily depends on your workload type - access patterns, 
write patterns, type of I/O, and size of average I/O request.

A couple of things here:

(1) Unless you are using Zvols for "raw" disk partitions (for use with
something like a database), the recordsize value is a MAXIMUM value, NOT 
an absolute value.  Thus, if you have a ZFS filesystem with a record 
size of 128k, it will break up I/O into 128k chunks for writing, but it 
will also write smaller chunks.  I forget what the minimum size is (512b 
or 1k, IIRC), but what ZFS does is use a Variable block size, up to the 
maximum size specified in the "recordsize" property.   So, if 
recordsize=128k and you have a 190k write I/O op, it will write a 128k 
chunk, and a 64k chunk (64 being the smallest multiple of 2 greater than 
the remaining 62 bits of info).  It WON''T write two 128k chunks.

(2) #1 comes up a bit when you have a mix of file sizes - for instance, 
home directories, where you have lots of small files (initialization 
files, source code, etc.) combined with some much larger files (images, 
mp3s, executable binaries, etc.).  Thus, such a filesystem will have a 
wide variety of chunk sizes, which makes optimization difficult, to say 
the least.

(3) For *random* I/O, a raidZ of any number of disks performs roughly 
like a *single* disk in terms of IOPs and a little better than a single 
disk in terms of throughput.  So, if you have considerable amounts of 
random I/O, you should really either use small raidz configs (no more 
than 4 data disks), or switch to mirrors instead.

(4) For *sequential* or large-size I/O, a raidZ performs roughly 
equivalent to a stripe of the same number of data disks. That is, a 
N-disk raidz2 will perform about the same as a (N-2) disk stripe in 
terms of throughput and IOPS.

(5) As I mentioned in #1, *all* ZFS I/O is broken up into 
powers-of-two-sized chunks, even if the last chunk must have some 
padding in it to get to a power-of-two.   This has implications as to 
the best number of disks in a raidZ(n).

I''d have to re-look at the ZFS Best Practices Guide, but I''m
pretty sure
the recommendation of 7, 9, or 11 disks was for a raidz1, NOT a raidz2.  
Due to #5 above, best performance comes with an EVEN number of data 
disks in any raidZ, so a write to any disks is always a full portion of 
the chunk, rather than a partial one (that sounds funny, but trust me).  
The best balance of size, IOPs, and throughput is found in the mid-size 
raidZ(n) configs, where there are 4, 6 or 8 data disks.

Honestly, even with you describing a workload, it will be hard for us to 
give you a real exact answer. My best suggestion is to do some testing 
with raidZ(n) of different sizes, to see the tradeoffs between size and 
performance.

Also, in your sample config, unless you plan to use the spare disks for 
redundancy on the boot mirror, it would be better to configure 2 x 
11-disk raidZ3 than 2 x 10-disk raidZ2 + 2 spares. Better reliability.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Paul Piscuc

2010-Nov-29 11:10 UTC

head link

[zfs-discuss] Recomandations

Hi,

Thanks for the quick reply. Now that you have mentioned , we have a
different issue. What is the advantage of using spare disks instead of
including them in the raid-z array? If the system pool is on mirrored disks,
I think that this would be enough (hopefully).  When one disk fails,
isn''t
it better to have a spare disk on hold, instead of one more disk in the
raid-z and no spares(or just a few)? or, rephrased, is it safer and faster
to replace a disk in a raid-z3 and restore the data from the other disks, or
to have a raid-z2 with a spare disk?

Thank you,

On Mon, Nov 29, 2010 at 6:03 AM, Erik Trimble <erik.trimble at
oracle.com>wrote:
> On 11/28/2010 1:51 PM, Paul Piscuc wrote:
>
>> Hi,
>>
>> We are a company that want to replace our current  storage layout with
one
>> that uses ZFS. We have been testing it for a month now, and everything
looks
>> promising. One element that we cannot determine is the optimum number
of
>> disks in a raid-z pool. In the ZFS best practice guide, 7,9 and 11
disks are
>> recommended to be used in a single raid-z2.  On the other hand, another
user
>> specifies that the most important part is the distribution of the
defaul
>> 128k record size to all the disks. So, the recommended layout would be:
>>
>> 4-disk RAID-Z2 = 128KiB / 2 = 64KiB = good
>> 5-disk RAID-Z2 = 128KiB / 3 = ~43KiB = not good
>> 6-disk RAID-Z2 = 128KiB / 4 = 32KiB = good
>> 10-disk RAID-Z2 = 128KiB / 8 = 16KiB = good
>>
>> What is your recommendations regarding the number of disks? We are
>> planning to use 2 raid-z2 pools with 8+2 disks, 2 spare, 2 SSDs for
L2ARC, 2
>> SSDs for ZIL, 2 for syspool, and a similar machine for replication.
>>
>> Thanks in advance,
>>
>>
> You''ve hit on one of the hardest parts of using ZFS -
optimization.   Truth
> of the matter is that there is NO one-size-fits-all "best"
solution. It
> heavily depends on your workload type - access patterns, write patterns,
> type of I/O, and size of average I/O request.
>
> A couple of things here:
>
> (1) Unless you are using Zvols for "raw" disk partitions (for use
with
> something like a database), the recordsize value is a MAXIMUM value, NOT an
> absolute value.  Thus, if you have a ZFS filesystem with a record size of
> 128k, it will break up I/O into 128k chunks for writing, but it will also
> write smaller chunks.  I forget what the minimum size is (512b or 1k,
IIRC),
> but what ZFS does is use a Variable block size, up to the maximum size
> specified in the "recordsize" property.   So, if recordsize=128k
and you
> have a 190k write I/O op, it will write a 128k chunk, and a 64k chunk (64
> being the smallest multiple of 2 greater than the remaining 62 bits of
> info).  It WON''T write two 128k chunks.
>
> (2) #1 comes up a bit when you have a mix of file sizes - for instance,
> home directories, where you have lots of small files (initialization files,
> source code, etc.) combined with some much larger files (images, mp3s,
> executable binaries, etc.).  Thus, such a filesystem will have a wide
> variety of chunk sizes, which makes optimization difficult, to say the
> least.
>
> (3) For *random* I/O, a raidZ of any number of disks performs roughly like
> a *single* disk in terms of IOPs and a little better than a single disk in
> terms of throughput.  So, if you have considerable amounts of random I/O,
> you should really either use small raidz configs (no more than 4 data
> disks), or switch to mirrors instead.
>
> (4) For *sequential* or large-size I/O, a raidZ performs roughly equivalent
> to a stripe of the same number of data disks. That is, a N-disk raidz2 will
> perform about the same as a (N-2) disk stripe in terms of throughput and
> IOPS.
>
> (5) As I mentioned in #1, *all* ZFS I/O is broken up into
> powers-of-two-sized chunks, even if the last chunk must have some padding
in
> it to get to a power-of-two.   This has implications as to the best number
> of disks in a raidZ(n).
>
>
> I''d have to re-look at the ZFS Best Practices Guide, but
I''m pretty sure
> the recommendation of 7, 9, or 11 disks was for a raidz1, NOT a raidz2. 
Due
> to #5 above, best performance comes with an EVEN number of data disks in
any
> raidZ, so a write to any disks is always a full portion of the chunk,
rather
> than a partial one (that sounds funny, but trust me).  The best balance of
> size, IOPs, and throughput is found in the mid-size raidZ(n) configs, where
> there are 4, 6 or 8 data disks.
>
>
> Honestly, even with you describing a workload, it will be hard for us to
> give you a real exact answer. My best suggestion is to do some testing with
> raidZ(n) of different sizes, to see the tradeoffs between size and
> performance.
>
>
> Also, in your sample config, unless you plan to use the spare disks for
> redundancy on the boot mirror, it would be better to configure 2 x 11-disk
> raidZ3 than 2 x 10-disk raidZ2 + 2 spares. Better reliability.
>
>
> --
> Erik Trimble
> Java System Support
> Mailstop:  usca22-123
> Phone:  x17195
> Santa Clara, CA
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101129/4a9e71ac/attachment.html>

taemun

2010-Nov-29 11:33 UTC

head link

[zfs-discuss] Recomandations

On 29 November 2010 15:03, Erik Trimble <erik.trimble at oracle.com>
wrote:
> I''d have to re-look at the ZFS Best Practices Guide, but
I''m pretty sure
> the recommendation of 7, 9, or 11 disks was for a raidz1, NOT a raidz2. 
Due
> to #5 above, best performance comes with an EVEN number of data disks in
any
> raidZ, so a write to any disks is always a full portion of the chunk,
rather
> than a partial one (that sounds funny, but trust me).  The best balance of
> size, IOPs, and throughput is found in the mid-size raidZ(n) configs, where
> there are 4, 6 or 8 data disks.
>
Let the maximum block size of 128KiB = s

If the number of disks in a raidz vdev = n, p = number of parity disks used
and d = data drives.

Hence, n = d + p

So, for some given numbers of d:
d s/d
1 128
2 64
3 42.67
4 32
5 25.6
6 21.33
7 18.29
8 16
9 14.22
10 12.8

Hence, for a raidz vdev with a width of 7, d = 6; s/d = 21.33KiB. This
isn''t
an ideal block size by any stretch of the imagination. Same thing for a
width of 11, d = 10, s/d = 12.8KiB.

What you were aiming for: for ideal performance, one should keep the vdev
width to the form 2^x + p. So, for raidz: 2, 3, 5, 9, 17. raidz2: 3, 4, 6,
10, 18, etc.

Cheers,
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101129/720bc0cd/attachment.html>

Edward Ned Harvey

2010-Nov-29 14:10 UTC

head link

[zfs-discuss] Recomandations

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Erik Trimble
> 
> (1) Unless you are using Zvols for "raw" disk partitions (for use
with
> something like a database), the recordsize value is a MAXIMUM value, NOT
> an absolute value.  Thus, if you have a ZFS filesystem with a record size
of> 128k, it will break up I/O into 128k chunks for writing, but it will also
write> smaller chunks.  I forget what the minimum size is (512b or 1k, IIRC), but
what> ZFS does is use a Variable block size, up to the
> maximum size specified in the "recordsize" property.   So, if
> recordsize=128k and you have a 190k write I/O op, it will write a 128k
chunk,> and a 64k chunk (64 being the smallest multiple of 2 greater than the
> remaining 62 bits of info).  It WON''T write two 128k chunks.
So ... Suppose there is a raidz2 with 8+2 disks.  You write a 128K chunk,
which gets divided up into 8 parts, and each disk writes a 16K block, right?


It seems to me, limiting the max size of data that a disk can write will
ultimately result in more random scattering of information about in the
drives, and degrade performance.  

We previously calculated (in some other thread) that in order for a drive to
be "efficient" which we defined as 99% useful and 1% wasted time
seeking,
then each disk would need to be read/writing 40M blocks consistently.  (Of
course, depending on the specs of the drive, but typical consumer &
enterprise disks were consistently around 40M.)  

So wouldn''t it be useful to set the recordsize to something huge?  Then
if
you''ve got a large chunk of data to be written, it''s actually
*permitted* to
be written as a large chunk instead of forcibly breaking it up?

Edward Ned Harvey

2010-Nov-29 14:28 UTC

head link

[zfs-discuss] Recomandations

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Paul Piscuc
> 
> looks promising. One element that we cannot determine is the optimum
> number of disks in a raid-z pool. In the ZFS best practice guide, 7,9 and11

There are several important things to consider:

-1- Performance in usage.
-2- Cost to buy disks & slots to hold disks.
-3- Resilver / scrub time.

You''re already on the right track to answer #1 and #2.  So I want to
talk a
little bit about #3

For typical usage on spindle hard disks, ZFS has a problem with resilver and
scrub time.  It will only resilver or scrub the used areas of disk, which
seems like it would be faster than doing the whole disk, but since it ends
up being a whole bunch of small sectors scattered about the disk, and
typically most of the disk is used, and the order of resilver/scrub is not
in disk order, it means you end up needing to do random seeks all over the
disk, to read/write nearly the whole disk.  The end result is a resilver
time that can be 1-2 orders of magnitude larger than you expected.  Like a
week or three, if you have a bad configuration (lots of disks in a vdev) ...
or 12-24 hours in the best case (mirrors and nothing else).

The problem is linearly related to the number of used chunks in the degraded
vdev, which is itself, usually approximated as a fraction of the total pool.
So you minimize the problem if you use mirrors, and you maximize the problem
if you make your pool from one huge raidzN vdev.

On my disks, for a sun server where this was an issue for me ...  If I
needed to resilver the entire disk sequentially, including unused space, it
would have required 2 hrs.  I use ZFS mirrors, and it actually took 12 hrs.
If I had made the pool one big raidzN, it would have needed 20 days.

Until this problem is fixed, I recommend using mirrors only, and staying
away from raidzN, unless you''re going to build your whole pool out of
SSD''s.

zfs discuss - Nov 2010 - Recomandations

[zfs-discuss] Recomandations

[zfs-discuss] Recomandations

[zfs-discuss] Recomandations

[zfs-discuss] Recomandations

[zfs-discuss] Recomandations

[zfs-discuss] Recomandations