I know there was a thread about this a few months ago. However, with the costs of SSD''s falling like they have, the idea of an Oracle X4270 M2/Cisco C210 M2/IBM x3650 M3 class of machine with a 13 drive RAIDZ2 zpool (1 hot spare) is really starting to sound alluring to me/us. Especially with something like the OCZ Deneva 2 drives (Sandforce 2281 with a supercap), the SanDisk (Pliant) Lightning series, or perhaps the Hitachi SSD400M''s coming in at prices that aren''t a whole lot more than 600GB 15k drives. (From an enterprise perspective anyway.) Systems with a similar load (OLTP) are frequently I/O bound - eg a server with a Sun 2540 FC array w/ 11x300GB 15k SAS drives and 2x X25-e''s for ZIL/L2ARC, so the extra bandwidth would be welcome. Am I crazy for putting something like this into production using Solaris 10/11? On paper, it really seems ideal for our needs. Also, maybe I read it wrong, but why is it that (in the previous thread about hw raid and zpools) zpools with large numbers of physical drives (eg 20+) were frowned upon? I know that ZFS!=WAFL but it''s so common in the NetApp world that I was surprised to read that. A 20 drive RAID-Z2 pool really wouldn''t/couldn''t recover (resilver) from a drive failure? That seems to fly in the face of the x4500 boxes from a few years ago. matt
On Tue, 27 Sep 2011, Matt Banks wrote:> > Am I crazy for putting something like this into production using Solaris 10/11? On paper, it really seems ideal for our needs.As long as the drive firmware operates correctly, I don''t see a problem.> Also, maybe I read it wrong, but why is it that (in the previous > thread about hw raid and zpools) zpools with large numbers of > physical drives (eg 20+) were frowned upon? I know that ZFS!=WAFLThere is no concern with a large number of physical drives in a pool. The primary concern is with the number of drives per vdev. Any variation in the latency of the drives hinders performance and each I/O to a vdev consumes 1 "IOP" across all of the drives in the vdev (or strip) when raidzN is used. Having more vdevs is better for consistent performance and more available IOPS. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Tue, Sep 27, 2011 at 1:21 PM, Matt Banks <mattbanks at gmail.com> wrote:> Also, maybe I read it wrong, but why is it that (in the previous thread about > hw raid and zpools) zpools with large numbers of physical drives (eg 20+) > were frowned upon? I know that ZFS!=WAFL but it''s so common in the > NetApp world that I was surprised to read that. A 20 drive RAID-Z2 pool > really wouldn''t/couldn''t recover (resilver) from a drive failure? That seems > to fly in the face of the x4500 boxes from a few years ago.There is a world of difference between a zpool with 20+ drives and a single vdev with 20+ drives. What has been frowned upon is a single vdev with more than about 8 drives. I have a zpool with 120 drives, 22 vdevs each with 5 drives in a raidz2 and 10 hot spares. The only failures I had to resilver were before it went production (and I had little data in it at the time), but I expect resilver times to be reasonable based on experience with other configurations I have had. Keep in mind that random read I/O is proportional to the number of vdevs, NOT the number of drives. See https://docs.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsLXc&output=html for the results of some of my testing. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Designer: Frankenstein, A New Musical (http://www.facebook.com/event.php?eid=123170297765140) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
On 9/27/2011 10:39 AM, Bob Friesenhahn wrote:> On Tue, 27 Sep 2011, Matt Banks wrote: > >> Also, maybe I read it wrong, but why is it that (in the previous >> thread about hw raid and zpools) zpools with large numbers of >> physical drives (eg 20+) were frowned upon? I know that ZFS!=WAFL > > There is no concern with a large number of physical drives in a pool. > The primary concern is with the number of drives per vdev. Any > variation in the latency of the drives hinders performance and each > I/O to a vdev consumes 1 "IOP" across all of the drives in the vdev > (or strip) when raidzN is used. Having more vdevs is better for > consistent performance and more available IOPS. > > BobTo expound just a bit on Bob''s reply: the reason that large numbers of disks in a RAIDZ* vdev are frowned upon has to do with the fact that IOPS for a RAIDZ vdev are pretty much O(C), regardless of how many disks are in the actual vdev. So, the IOPS throughput of a 20-disk vdev is the same as a 5-disk vdev. Streaming throughput is significantly higher (i.e. it scales as O(N)), but you''re unlikely to get that for the vast majority of workloads. Given that resilvering a RAIDZ* is IOPS-bound, you quickly run into the situation where the time to resilver X amount of data on a 5-drive RAIDZ is the same as a 30-drive RAIDZ. Given that you''re highly likely to store much more data on a larger vdev, your resilver time to replace a drive goes up linearly with the number of drives in a RAIDZ vdev. This leads to this situation: if I have 20 x 1TB drives, here''s several possible configurations, and the relative resilver times (relative, because without knowing the exact configuration of the data itself, I can''t estimate wall-clock-time resilver times): (a) 5 x 4-disk RAIDZ: 15TB usable, takes N amount of time to replace a failed disk (b) 4 x 5-disk RAIDZ: 16TB usable, takes 1.25N time to replace a disk (c) 2 x 10-disk RAIDZ: 18TB Usable, takes 2.5N time to replace a disk (d) 1 x 20-disk RAIDZ: 19TB usable, takes 5N time to replace a disk Notice that by doubling the number of drives in a RAIDZ, you double the resilver time for the same amount of data in the ZPOOL. The above also applies to RAIDZ[23], as the additional parity disk doesn''t materially impact resilver times in either direction (and, yes, it''s not really a "parity disk", I''m just being sloppy). Also, the other main reason is that larger numbers of drives in a single vdev mean there is a higher probability that multiple disk failures will result in loss of data. Richard Elling had some data on the exact calculations, but it boils down to the fact that your chance of total data loss from multiple drive failures goes up MORE THAN LINEARLY by adding drives into a vdev. Thus, a 1x10-disk RAIDZ has well over 2x the chance of failure that 2 x 5-disk RAIDZ zpool has. -Erik
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Matt Banks > > Am I crazy for putting something like this into production using Solaris10/11?> On paper, it really seems ideal for our needs.Do you have an objection to solaris 10/11 for some reason? No, it''s not crazy (and I wonder why you would ask).> Also, maybe I read it wrong, but why is it that (in the previous threadabout> hw raid and zpools) zpools with large numbers of physical drives (eg 20+)Clarification that I know others have already added, but I reiterate: It''s not the number of devices in a zpool that matters. It''s the amount of data in the resilvering vdev, and the number of devices inside the vdev, and your usage patterns (where the typical use pattern is the worst case usage pattern, especially for a database server). Together these of course have a relation to the number of devices in the pool, but that''s not what matters. The problem basically applies to HDD''s. By creating your pool of SSD''s, this problem should be eliminated. Here is the problem: Assuming the data in the pool is evenly distributed amongst the vdev''s, then the more vdev''s you have, the less data is in each one. If you make your pool of a small number of large raidzN vdev''s, then you''re going to have relatively a lot of data in each vdev, and therefore a lot of data in the resilvering vdev. When a vdev resilvers, it will read each slab of data, in essentially time order, which is approximately random disk order, in order to reconstruct the data that must be written on the resilvering device. This creates two problems, (a) Since each disk must fetch a piece of each slab, the random access time of the vdev as a whole is approximately the random access time of the slowest individual device. So the more devices in the vdev, the worse the IOPS for the vdev... And (b) the more data slabs in the vdev, the more iterations of random IO operations must be completed. In other words, during resilvers, you''re IOPS limited. If your pool is made of all SSD''s, then problem (a) is basically nonexistent, since the random access time of all the devices are equal and essentially zero. Problem (b) isn''t necessarily a problem... It''s like, if somebody is giving you $1,000 for free every month and then they suddenly drop down to only $500, you complain about what you''ve lost. ;-) (See below.) In a hardware raid system, resilvering will be done sequentially on all disks in the array. Depending on your specs, a typical time might be 2hrs. All blocks will be resilvered regardless of whether or not they''re used. But in ZFS, only used blocks will be resilvered. That means, if your vdev is empty, your resilver is completed instantly. Also, if your vdev is made of SSD''s, then the random access times will be just like the sequential access times, and your worst case is still equal to hardware raid resilver. The only time there''s a problem is when you have a vdev made of HDD''s, and there''s a bunch of data in it, and it''s scattered randomly (which typically happens due to the nature of COW and snapshot deletion/creation over time). So the HDD''s thrash around spending all their time doing random access, with very little payload for each random op. In these cases, even HDD mirrors end up having resilver times that are several times longer than sequentially resilvering the whole disk including unused blocks. In this case, mirrors are the best case scenario, because they''re both (a) minimal data in each vdev, and (b) minimal number of devices in the resilvering vdev. Even so, the mirror resilver time might be like 12 hours, in my experience, instead of the 2hrs that hardware would have needed to resilver the whole disk. But if you were using a big vdev (raidzN) of a bunch of HDD''s (let''s say, 21 disks in a raidz3), you might get resilver times that are a couple orders of magnitude too long... Like 20 days instead of 10 hours. At this level, you should assume your resilver will never complete. So again: Not a problem if you''re making your pool out of SSD''s.
On Wed, Sep 28, 2011 at 8:21 AM, Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> When a vdev resilvers, it will read each slab of data, in essentially time > order, which is approximately random disk order, in order to reconstruct the > data that must be written on the resilvering device. ?This creates two > problems, (a) Since each disk must fetch a piece of each slab, the random > access time of the vdev as a whole is approximately the random access time > of the slowest individual device. ?So the more devices in the vdev, the > worse the IOPS for the vdev... ?And (b) the more data slabs in the vdev, the > more iterations of random IO operations must be completed. > > In other words, during resilvers, you''re IOPS limited. ?If your pool is made > of all SSD''s, then problem (a) is basically nonexistent, since the random > access time of all the devices are equal and essentially zero. ?Problem (b) > isn''t necessarily a problem... ?It''s like, if somebody is giving you $1,000 > for free every month and then they suddenly drop down to only $500, you > complain about what you''ve lost. ? ;-) ?(See below.)If you regularly spend all of the given $1,000, then you''re going to complain hard when it suddenly drops to $500.> So again: ?Not a problem if you''re making your pool out of SSD''s.Big problem if your system is already using most of the available IOPS during normal operation. -- Fajar
On Tue, 27 Sep 2011, Edward Ned Harvey wrote:> > The problem basically applies to HDD''s. By creating your pool of SSD''s, > this problem should be eliminated.This is not completely true. SSDs will help significantly but they will still suffer from the synchronized commit of a transaction group. SSDs don''t suffer from seek time, but they still suffer from erase/write time and many SSDs are capable of only a few thousand flushed writes per second. It is just a matter of degree. SSDs which do garbage collection during the write cycle could cause the whole vdev to temporarily hang until the last SSD has committed its write. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sep 27, 2011, at 6:30 PM, Fajar A. Nugraha wrote:> On Wed, Sep 28, 2011 at 8:21 AM, Edward Ned Harvey >> So again: Not a problem if you''re making your pool out of SSD''s. > > Big problem if your system is already using most of the available IOPSduring normal operation.Resilvers are throttled, so they should not impact normal operation.> On Sep 27, 2011, at 6:36 PM, Bob Friesenhahn wrote: > On Tue, 27 Sep 2011, Edward Ned Harvey wrote: >> >> The problem basically applies to HDD''s. By creating your pool of SSD''s, >> this problem should be eliminated. > > This is not completely true. SSDs will help significantly but they will still suffer from the synchronized commit of a transaction group. SSDs don''t suffer from seek time, but they still suffer from erase/write time and many SSDs are capable of only a few thousand flushed writes per second. It is just a matter of degree.Also, the default settings for the resilver throttle are set for HDDs. For SSDs, it is a good idea to change the throttle to be more aggressive.> SSDs which do garbage collection during the write cycle could cause the whole vdev to temporarily hang until the last SSD has committed its write.I think this will be unlikely, especially for a resilver workload. Resilvers are done async, so the only time you will be waiting is for the return of the cache flush during txg commit. In the cases I''ve measured cache flushes, they tend to complete faster on SSDs than HDDs, but it might be worthwhile to characterize this so we know. -- richard -- ZFS and performance consulting http://www.RichardElling.com VMworld Copenhagen, October 17-20 OpenStorage Summit, San Jose, CA, October 24-27 LISA ''11, Boston, MA, December 4-9
> From: Richard Elling [mailto:richard.elling at gmail.com] > > Also, the default settings for the resilver throttle are set for HDDs. ForSSDs,> it is a > good idea to change the throttle to be more aggressive.You mean... Be more aggressive, resilver faster? or Be more aggressive, throttling the resilver? What''s the reasoning that makes you want to set it differently from a HDD?
On Sep 28, 2011, at 8:44 PM, Edward Ned Harvey wrote:>> From: Richard Elling [mailto:richard.elling at gmail.com] >> >> Also, the default settings for the resilver throttle are set for HDDs. For > SSDs, >> it is a >> good idea to change the throttle to be more aggressive. > > You mean... > Be more aggressive, resilver faster? > or Be more aggressive, throttling the resilver? > > What''s the reasoning that makes you want to set it differently from a HDD?I think he means, resilver faster. SSDs can be driven harder, and have more IOPs so we can hit them harder with less impact on the overall performance. The reason we throttle at all is to avoid saturating the bandwidth of the drive with resilver which would prevent regular operations from making progress. Generally I believe resilver operations are not "bandwidth bound" in the sense of pure throughput, but are IOPs bound. As SSDs have no seek time, they can handle a lot more of these little operations than a regular hard disk. - Garrett> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Thu, Sep 29, 2011 at 11:33 AM, Garrett D''Amore <garrett.damore at gmail.com>wrote:> > > I think he means, resilver faster. > > SSDs can be driven harder, and have more IOPs so we can hit them harder > with less impact on the overall performance. The reason we throttle at all > is to avoid saturating the bandwidth of the drive with resilver which would > prevent regular operations from making progress. Generally I believe > resilver operations are not "bandwidth bound" in the sense of pure > throughput, but are IOPs bound. As SSDs have no seek time, they can handle > a lot more of these little operations than a regular hard disk. > > - Garrett > >What''s the throttling rate if I may call it that? -- Zaeem -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110929/6bde3078/attachment.html>
2011-09-29 17:15, Zaeem Arshad ?????:> > > On Thu, Sep 29, 2011 at 11:33 AM, Garrett D''Amore > <garrett.damore at gmail.com <mailto:garrett.damore at gmail.com>> wrote: > > > > I think he means, resilver faster. > > SSDs can be driven harder, and have more IOPs so we can hit them > harder with less impact on the overall performance. The reason we > throttle at all is to avoid saturating the bandwidth of the drive > with resilver which would prevent regular operations from making > progress. Generally I believe resilver operations are not > "bandwidth bound" in the sense of pure throughput, but are IOPs > bound. As SSDs have no seek time, they can handle a lot more of > these little operations than a regular hard disk. > > - Garrett > > > > What''s the throttling rate if I may call it that? >IIRC about 7MBps, and I guess it is hardcoded since the value is so well known as to have been reported several times. I think another rationale for SSD throttling was with L2ARC tasks - to reduce probable effects of write overdriving in SSD hardwares (less efficient and more wear on SSD cells). //Jim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111016/67fb3348/attachment-0001.html>
On Oct 16, 2011, at 3:56 AM, Jim Klimov wrote:> 2011-09-29 17:15, Zaeem Arshad ?????: >> >> >> On Thu, Sep 29, 2011 at 11:33 AM, Garrett D''Amore <garrett.damore at gmail.com> wrote: >> >> >> I think he means, resilver faster. >> >> SSDs can be driven harder, and have more IOPs so we can hit them harder with less impact on the overall performance. The reason we throttle at all is to avoid saturating the bandwidth of the drive with resilver which would prevent regular operations from making progress. Generally I believe resilver operations are not "bandwidth bound" in the sense of pure throughput, but are IOPs bound. As SSDs have no seek time, they can handle a lot more of these little operations than a regular hard disk. >> >> - Garrett >> >> >> >> What''s the throttling rate if I may call it that? >> > > IIRC about 7MBps, and I guess it is hardcoded since the value > is so well known as to have been reported several times.No, the resilver throttling is more based on IOPS than bandwidth.> I think another rationale for SSD throttling was with L2ARC tasks - > to reduce probable effects of write overdriving in SSD hardwares > (less efficient and more wear on SSD cells).L2ARC fill rate is, by default in most distros, 16MB/sec until full, then 8MB/sec. -- richard -- ZFS and performance consulting http://www.RichardElling.com VMworld Copenhagen, October 17-20 OpenStorage Summit, San Jose, CA, October 24-27 LISA ''11, Boston, MA, December 4-9