Hi, We are a company that want to replace our current storage layout with one that uses ZFS. We have been testing it for a month now, and everything looks promising. One element that we cannot determine is the optimum number of disks in a raid-z pool. In the ZFS best practice guide, 7,9 and 11 disks are recommended to be used in a single raid-z2. On the other hand, another user specifies that the most important part is the distribution of the defaul 128k record size to all the disks. So, the recommended layout would be: 4-disk RAID-Z2 = 128KiB / 2 = 64KiB = good 5-disk RAID-Z2 = 128KiB / 3 = ~43KiB = not good 6-disk RAID-Z2 = 128KiB / 4 = 32KiB = good 10-disk RAID-Z2 = 128KiB / 8 = 16KiB = good What is your recommendations regarding the number of disks? We are planning to use 2 raid-z2 pools with 8+2 disks, 2 spare, 2 SSDs for L2ARC, 2 SSDs for ZIL, 2 for syspool, and a similar machine for replication. Thanks in advance, -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101128/db44e0b8/attachment.html>
On 11/28/2010 1:51 PM, Paul Piscuc wrote:> Hi, > > We are a company that want to replace our current storage layout with > one that uses ZFS. We have been testing it for a month now, and > everything looks promising. One element that we cannot determine is > the optimum number of disks in a raid-z pool. In the ZFS best practice > guide, 7,9 and 11 disks are recommended to be used in a single > raid-z2. On the other hand, another user specifies that the most > important part is the distribution of the defaul 128k record size to > all the disks. So, the recommended layout would be: > > 4-disk RAID-Z2 = 128KiB / 2 = 64KiB = good > 5-disk RAID-Z2 = 128KiB / 3 = ~43KiB = not good > 6-disk RAID-Z2 = 128KiB / 4 = 32KiB = good > 10-disk RAID-Z2 = 128KiB / 8 = 16KiB = good > > What is your recommendations regarding the number of disks? We are > planning to use 2 raid-z2 pools with 8+2 disks, 2 spare, 2 SSDs for > L2ARC, 2 SSDs for ZIL, 2 for syspool, and a similar machine for > replication. > > Thanks in advance, >You''ve hit on one of the hardest parts of using ZFS - optimization. Truth of the matter is that there is NO one-size-fits-all "best" solution. It heavily depends on your workload type - access patterns, write patterns, type of I/O, and size of average I/O request. A couple of things here: (1) Unless you are using Zvols for "raw" disk partitions (for use with something like a database), the recordsize value is a MAXIMUM value, NOT an absolute value. Thus, if you have a ZFS filesystem with a record size of 128k, it will break up I/O into 128k chunks for writing, but it will also write smaller chunks. I forget what the minimum size is (512b or 1k, IIRC), but what ZFS does is use a Variable block size, up to the maximum size specified in the "recordsize" property. So, if recordsize=128k and you have a 190k write I/O op, it will write a 128k chunk, and a 64k chunk (64 being the smallest multiple of 2 greater than the remaining 62 bits of info). It WON''T write two 128k chunks. (2) #1 comes up a bit when you have a mix of file sizes - for instance, home directories, where you have lots of small files (initialization files, source code, etc.) combined with some much larger files (images, mp3s, executable binaries, etc.). Thus, such a filesystem will have a wide variety of chunk sizes, which makes optimization difficult, to say the least. (3) For *random* I/O, a raidZ of any number of disks performs roughly like a *single* disk in terms of IOPs and a little better than a single disk in terms of throughput. So, if you have considerable amounts of random I/O, you should really either use small raidz configs (no more than 4 data disks), or switch to mirrors instead. (4) For *sequential* or large-size I/O, a raidZ performs roughly equivalent to a stripe of the same number of data disks. That is, a N-disk raidz2 will perform about the same as a (N-2) disk stripe in terms of throughput and IOPS. (5) As I mentioned in #1, *all* ZFS I/O is broken up into powers-of-two-sized chunks, even if the last chunk must have some padding in it to get to a power-of-two. This has implications as to the best number of disks in a raidZ(n). I''d have to re-look at the ZFS Best Practices Guide, but I''m pretty sure the recommendation of 7, 9, or 11 disks was for a raidz1, NOT a raidz2. Due to #5 above, best performance comes with an EVEN number of data disks in any raidZ, so a write to any disks is always a full portion of the chunk, rather than a partial one (that sounds funny, but trust me). The best balance of size, IOPs, and throughput is found in the mid-size raidZ(n) configs, where there are 4, 6 or 8 data disks. Honestly, even with you describing a workload, it will be hard for us to give you a real exact answer. My best suggestion is to do some testing with raidZ(n) of different sizes, to see the tradeoffs between size and performance. Also, in your sample config, unless you plan to use the spare disks for redundancy on the boot mirror, it would be better to configure 2 x 11-disk raidZ3 than 2 x 10-disk raidZ2 + 2 spares. Better reliability. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
Hi, Thanks for the quick reply. Now that you have mentioned , we have a different issue. What is the advantage of using spare disks instead of including them in the raid-z array? If the system pool is on mirrored disks, I think that this would be enough (hopefully). When one disk fails, isn''t it better to have a spare disk on hold, instead of one more disk in the raid-z and no spares(or just a few)? or, rephrased, is it safer and faster to replace a disk in a raid-z3 and restore the data from the other disks, or to have a raid-z2 with a spare disk? Thank you, On Mon, Nov 29, 2010 at 6:03 AM, Erik Trimble <erik.trimble at oracle.com>wrote:> On 11/28/2010 1:51 PM, Paul Piscuc wrote: > >> Hi, >> >> We are a company that want to replace our current storage layout with one >> that uses ZFS. We have been testing it for a month now, and everything looks >> promising. One element that we cannot determine is the optimum number of >> disks in a raid-z pool. In the ZFS best practice guide, 7,9 and 11 disks are >> recommended to be used in a single raid-z2. On the other hand, another user >> specifies that the most important part is the distribution of the defaul >> 128k record size to all the disks. So, the recommended layout would be: >> >> 4-disk RAID-Z2 = 128KiB / 2 = 64KiB = good >> 5-disk RAID-Z2 = 128KiB / 3 = ~43KiB = not good >> 6-disk RAID-Z2 = 128KiB / 4 = 32KiB = good >> 10-disk RAID-Z2 = 128KiB / 8 = 16KiB = good >> >> What is your recommendations regarding the number of disks? We are >> planning to use 2 raid-z2 pools with 8+2 disks, 2 spare, 2 SSDs for L2ARC, 2 >> SSDs for ZIL, 2 for syspool, and a similar machine for replication. >> >> Thanks in advance, >> >> > You''ve hit on one of the hardest parts of using ZFS - optimization. Truth > of the matter is that there is NO one-size-fits-all "best" solution. It > heavily depends on your workload type - access patterns, write patterns, > type of I/O, and size of average I/O request. > > A couple of things here: > > (1) Unless you are using Zvols for "raw" disk partitions (for use with > something like a database), the recordsize value is a MAXIMUM value, NOT an > absolute value. Thus, if you have a ZFS filesystem with a record size of > 128k, it will break up I/O into 128k chunks for writing, but it will also > write smaller chunks. I forget what the minimum size is (512b or 1k, IIRC), > but what ZFS does is use a Variable block size, up to the maximum size > specified in the "recordsize" property. So, if recordsize=128k and you > have a 190k write I/O op, it will write a 128k chunk, and a 64k chunk (64 > being the smallest multiple of 2 greater than the remaining 62 bits of > info). It WON''T write two 128k chunks. > > (2) #1 comes up a bit when you have a mix of file sizes - for instance, > home directories, where you have lots of small files (initialization files, > source code, etc.) combined with some much larger files (images, mp3s, > executable binaries, etc.). Thus, such a filesystem will have a wide > variety of chunk sizes, which makes optimization difficult, to say the > least. > > (3) For *random* I/O, a raidZ of any number of disks performs roughly like > a *single* disk in terms of IOPs and a little better than a single disk in > terms of throughput. So, if you have considerable amounts of random I/O, > you should really either use small raidz configs (no more than 4 data > disks), or switch to mirrors instead. > > (4) For *sequential* or large-size I/O, a raidZ performs roughly equivalent > to a stripe of the same number of data disks. That is, a N-disk raidz2 will > perform about the same as a (N-2) disk stripe in terms of throughput and > IOPS. > > (5) As I mentioned in #1, *all* ZFS I/O is broken up into > powers-of-two-sized chunks, even if the last chunk must have some padding in > it to get to a power-of-two. This has implications as to the best number > of disks in a raidZ(n). > > > I''d have to re-look at the ZFS Best Practices Guide, but I''m pretty sure > the recommendation of 7, 9, or 11 disks was for a raidz1, NOT a raidz2. Due > to #5 above, best performance comes with an EVEN number of data disks in any > raidZ, so a write to any disks is always a full portion of the chunk, rather > than a partial one (that sounds funny, but trust me). The best balance of > size, IOPs, and throughput is found in the mid-size raidZ(n) configs, where > there are 4, 6 or 8 data disks. > > > Honestly, even with you describing a workload, it will be hard for us to > give you a real exact answer. My best suggestion is to do some testing with > raidZ(n) of different sizes, to see the tradeoffs between size and > performance. > > > Also, in your sample config, unless you plan to use the spare disks for > redundancy on the boot mirror, it would be better to configure 2 x 11-disk > raidZ3 than 2 x 10-disk raidZ2 + 2 spares. Better reliability. > > > -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101129/4a9e71ac/attachment.html>
On 29 November 2010 15:03, Erik Trimble <erik.trimble at oracle.com> wrote:> I''d have to re-look at the ZFS Best Practices Guide, but I''m pretty sure > the recommendation of 7, 9, or 11 disks was for a raidz1, NOT a raidz2. Due > to #5 above, best performance comes with an EVEN number of data disks in any > raidZ, so a write to any disks is always a full portion of the chunk, rather > than a partial one (that sounds funny, but trust me). The best balance of > size, IOPs, and throughput is found in the mid-size raidZ(n) configs, where > there are 4, 6 or 8 data disks. >Let the maximum block size of 128KiB = s If the number of disks in a raidz vdev = n, p = number of parity disks used and d = data drives. Hence, n = d + p So, for some given numbers of d: d s/d 1 128 2 64 3 42.67 4 32 5 25.6 6 21.33 7 18.29 8 16 9 14.22 10 12.8 Hence, for a raidz vdev with a width of 7, d = 6; s/d = 21.33KiB. This isn''t an ideal block size by any stretch of the imagination. Same thing for a width of 11, d = 10, s/d = 12.8KiB. What you were aiming for: for ideal performance, one should keep the vdev width to the form 2^x + p. So, for raidz: 2, 3, 5, 9, 17. raidz2: 3, 4, 6, 10, 18, etc. Cheers, -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101129/720bc0cd/attachment.html>
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Erik Trimble > > (1) Unless you are using Zvols for "raw" disk partitions (for use with > something like a database), the recordsize value is a MAXIMUM value, NOT > an absolute value. Thus, if you have a ZFS filesystem with a record sizeof> 128k, it will break up I/O into 128k chunks for writing, but it will alsowrite> smaller chunks. I forget what the minimum size is (512b or 1k, IIRC), butwhat> ZFS does is use a Variable block size, up to the > maximum size specified in the "recordsize" property. So, if > recordsize=128k and you have a 190k write I/O op, it will write a 128kchunk,> and a 64k chunk (64 being the smallest multiple of 2 greater than the > remaining 62 bits of info). It WON''T write two 128k chunks.So ... Suppose there is a raidz2 with 8+2 disks. You write a 128K chunk, which gets divided up into 8 parts, and each disk writes a 16K block, right? It seems to me, limiting the max size of data that a disk can write will ultimately result in more random scattering of information about in the drives, and degrade performance. We previously calculated (in some other thread) that in order for a drive to be "efficient" which we defined as 99% useful and 1% wasted time seeking, then each disk would need to be read/writing 40M blocks consistently. (Of course, depending on the specs of the drive, but typical consumer & enterprise disks were consistently around 40M.) So wouldn''t it be useful to set the recordsize to something huge? Then if you''ve got a large chunk of data to be written, it''s actually *permitted* to be written as a large chunk instead of forcibly breaking it up?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Paul Piscuc > > looks promising. One element that we cannot determine is the optimum > number of disks in a raid-z pool. In the ZFS best practice guide, 7,9 and11 There are several important things to consider: -1- Performance in usage. -2- Cost to buy disks & slots to hold disks. -3- Resilver / scrub time. You''re already on the right track to answer #1 and #2. So I want to talk a little bit about #3 For typical usage on spindle hard disks, ZFS has a problem with resilver and scrub time. It will only resilver or scrub the used areas of disk, which seems like it would be faster than doing the whole disk, but since it ends up being a whole bunch of small sectors scattered about the disk, and typically most of the disk is used, and the order of resilver/scrub is not in disk order, it means you end up needing to do random seeks all over the disk, to read/write nearly the whole disk. The end result is a resilver time that can be 1-2 orders of magnitude larger than you expected. Like a week or three, if you have a bad configuration (lots of disks in a vdev) ... or 12-24 hours in the best case (mirrors and nothing else). The problem is linearly related to the number of used chunks in the degraded vdev, which is itself, usually approximated as a fraction of the total pool. So you minimize the problem if you use mirrors, and you maximize the problem if you make your pool from one huge raidzN vdev. On my disks, for a sun server where this was an issue for me ... If I needed to resilver the entire disk sequentially, including unused space, it would have required 2 hrs. I use ZFS mirrors, and it actually took 12 hrs. If I had made the pool one big raidzN, it would have needed 20 days. Until this problem is fixed, I recommend using mirrors only, and staying away from raidzN, unless you''re going to build your whole pool out of SSD''s.