Hi. If I am using slightly more reliable SAS drives versus SATA, SSDs for both L2Arc and ZIL and lots of RAM, will a mirrored pool of say 24 disks hold any significant advantages over a RAIDZ pool? -------------------------------------------------------------------------------- This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. This communication may contain material protected by the attorney-client privilege. If you are not the intended recipient, be advised that any use, dissemination, forwarding, printing or copying is strictly prohibited. If you have received this email in error, please contact the sender and delete all copies.
it should be faster. It really depends on what you are using it for though, I''ve been using raidz for my system and i''m very happy with it. On Wed, Sep 16, 2009 at 8:55 AM, <eneal at businessgrade.com> wrote:> Hi. If I am using slightly more reliable SAS drives versus SATA, SSDs for > both L2Arc and ZIL and lots of RAM, will a mirrored pool of say 24 disks > hold any significant advantages over a RAIDZ pool? > > > > > > -------------------------------------------------------------------------------- > > This email and any files transmitted with it are confidential and are > intended solely for the use of the individual or entity to whom they are > addressed. This communication may contain material protected by the > attorney-client privilege. If you are not the intended recipient, be advised > that any use, dissemination, forwarding, printing or copying is strictly > prohibited. If you have received this email in error, please contact the > sender and delete all copies. > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090916/8d49c7bc/attachment.html>
> Hi. If I am using slightly more reliable SAS drives versus SATA, SSDs > for both L2Arc and ZIL and lots of RAM, will a mirrored pool of say 24 > disks hold any significant advantages over a RAIDZ pool?Generally speaking, striping mirrors will be faster than raidz or raidz2, but it will require a higher number of disks and therefore higher cost to get the same usable space. The main reason to use raidz or raidz2 instead of striping mirrors would be to keep the cost down, or to get higher usable space out of a fixed number of drives.
On Wed, September 16, 2009 10:31, Edward Ned Harvey wrote:>> Hi. If I am using slightly more reliable SAS drives versus SATA, SSDs >> for both L2Arc and ZIL and lots of RAM, will a mirrored pool of say 24 >> disks hold any significant advantages over a RAIDZ pool? > > Generally speaking, striping mirrors will be faster than raidz or raidz2, > but it will require a higher number of disks and therefore higher cost to > get the same usable space. The main reason to use raidz or raidz2 instead > of striping mirrors would be to keep the cost down, or to get higher > usable space out of a fixed number of drives.And if you want space /and/ speed, then ZFS'' hybrid storage pools is something worth looking into.
Quoting David Magda <dmagda at ee.ryerson.ca>:> On Wed, September 16, 2009 10:31, Edward Ned Harvey wrote: >>> Hi. If I am using slightly more reliable SAS drives versus SATA, SSDs >>> for both L2Arc and ZIL and lots of RAM, will a mirrored pool of say 24 >>> disks hold any significant advantages over a RAIDZ pool? >> >> Generally speaking, striping mirrors will be faster than raidz or raidz2, >> but it will require a higher number of disks and therefore higher cost to >> get the same usable space. The main reason to use raidz or raidz2 instead >> of striping mirrors would be to keep the cost down, or to get higher >> usable space out of a fixed number of drives. > > And if you want space /and/ speed, then ZFS'' hybrid storage pools is > something worth looking into. > >This is precisely my point. If I''m taking the hybrid approach - what advantages do mirrored pools hold over RAIDZ? As I mentioned, a large amount of RAM, and SSD''s for both L2arc and ZIL. -------------------------------------------------------------------------------- This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. This communication may contain material protected by the attorney-client privilege. If you are not the intended recipient, be advised that any use, dissemination, forwarding, printing or copying is strictly prohibited. If you have received this email in error, please contact the sender and delete all copies.
In addition, if you need the flexibility of moving disks around until the device removal CR integrates, then mirrored pools are more flexible. Detaching disks from a mirror isn''t ideal but if you absolutely have to reuse a disk temporarily then go with mirrors. See the output below. You can replace disks in either configuration if you want to switch smaller disks with larger disks, for example. Cindy # zpool status rzpool pool: rzpool state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Tue Sep 15 14:41:24 2009 config: NAME STATE READ WRITE CKSUM rzpool ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c2t6d0 ONLINE 0 0 0 spares c2t7d0 AVAIL errors: No known data errors # zpool detach rzpool c2t6d0 cannot detach c2t6d0: only applicable to mirror and replacing vdevs # zpool destroy rzpool # zpool create mirpool mirror c2t0d0 c2t2d0 mirror c2t4d0 c2t6d0 spare c2t5d0 # zpool status mirpool pool: mirpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM mirpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c2t6d0 ONLINE 0 0 0 spares c2t5d0 AVAIL errors: No known data errors # zpool detach mirpool c2t6d0 # On 09/16/09 08:31, Edward Ned Harvey wrote:>>Hi. If I am using slightly more reliable SAS drives versus SATA, SSDs >>for both L2Arc and ZIL and lots of RAM, will a mirrored pool of say 24 >>disks hold any significant advantages over a RAIDZ pool? > > > Generally speaking, striping mirrors will be faster than raidz or raidz2, > but it will require a higher number of disks and therefore higher cost to > get the same usable space. The main reason to use raidz or raidz2 instead > of striping mirrors would be to keep the cost down, or to get higher usable > space out of a fixed number of drives. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
I think in theory the ZIL/L2ARC should make things nice and fast if your workload includes sync requests (database, iscsi, nfs, etc.), regardless of the backend disks. But the only sure way to know is test with your work load. -Scott -- This message posted from opensolaris.org
On Wed, 16 Sep 2009, eneal at businessgrade.com wrote:> Hi. If I am using slightly more reliable SAS drives versus SATA, SSDs for > both L2Arc and ZIL and lots of RAM, will a mirrored pool of say 24 disks hold > any significant advantages over a RAIDZ pool?A mirrored pool will support more IOPs. This is even true when using SSDs for L2Arc and ZIL. Using a SSD for the ZIL dramatically reduces synchronous write latency but the data still needs to be committed to backing store. If the bulk of the synchronous writes are also random writes, then the throughput is still dependent on the IOPs capacity of the backing store. Similarly, more RAM and/or a large SSD L2Arc improves the probability that a repeated read will be retrieved from the ARC rather than the backing store but this depends on the size of the working set, and whether the reads are ever repeated. There are cases (e.g. daily backups) where reads are rarely repeated. In summary, write IOPs are still write IOPs, and a read cache only works effectively for repeated reads (or reads of recently written data). You still need to look at the nature of your workload in order to decide if RAIDZ is appropriate. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> Generally speaking, striping mirrors will be faster > than raidz or raidz2, > but it will require a higher number of disks and > therefore higher cost to > The main reason to use > raidz or raidz2 instead > of striping mirrors would be to keep the cost down, > or to get higher usable > space out of a fixed number of drives.While it has been a while since I have done storage management for critical systems, the advantage I see with RAIDZN is better fault tolerance: any N drives may fail before the set goes critical. With straight mirroring, failure of the wrong two drives will invalidate the whole pool. The advantage of striped mirrors is that it offers a better chance of higher iops (assuming the I/O is distributed correctly). Also, it might be easier to expand a mirror by upgrading only two drives with larger drives. With RAID, the entire stripe of drives would need to be upgraded. -- This message posted from opensolaris.org
It''s possible to do 3-way (or more) mirrors too, so you may achieve better redundancy than raidz2/3 Yours Markus Kovero -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Marty Scholes Sent: 16. syyskuuta 2009 19:38 To: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] RAIDZ versus mirrroed> Generally speaking, striping mirrors will be faster > than raidz or raidz2, > but it will require a higher number of disks and > therefore higher cost to > The main reason to use > raidz or raidz2 instead > of striping mirrors would be to keep the cost down, > or to get higher usable > space out of a fixed number of drives.While it has been a while since I have done storage management for critical systems, the advantage I see with RAIDZN is better fault tolerance: any N drives may fail before the set goes critical. With straight mirroring, failure of the wrong two drives will invalidate the whole pool. The advantage of striped mirrors is that it offers a better chance of higher iops (assuming the I/O is distributed correctly). Also, it might be easier to expand a mirror by upgrading only two drives with larger drives. With RAID, the entire stripe of drives would need to be upgraded. -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Wed, September 16, 2009 10:35, Cindy.Swearingen at Sun.COM wrote:> Detaching disks from a mirror isn''t ideal but if you absolutely have > to reuse a disk temporarily then go with mirrors. See the output below. > You can replace disks in either configuration if you want to switch > smaller disks with larger disks, for example.In a small configuration, like a home NAS, like I''m running, the upgrade issue was what drove me to mirrors over RAIDZ, despite the cost. A typical configuration would have 4 or 5 hot-swap bays. I have 8, though only interfaces for 6 of them, and two are used for boot disks in a mirror, so my data pool is in fact 4 drives. It was cheaper to start with a two-disk mirror, knowing that I could add a second two-disk mirror when needed, than it would have been to invest in 4 disks right away. And (after filling all the slots) it''s cheaper to upgrade the two disks in a mirror than the ~4 disks in a RAIDZ if I need more space. Despite my digital photography, and multiple housemates, I haven''t filled the current 800gb usable space (two vdevs, each a two-disk mirror of 400GB drives). By the time I do, I will certainly be able to afford larger drives! -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
At the end of the day, it TOTALLY depends on your needs. raidz may be the best bet for you if you simply do not need the speed of mirrors, and as another user mentioned, it DOES offer better fault tollerence. Figure out what your needs are for your workload THEN ask. These type of loaded questions will ALWAYS get the same answers. On Wed, Sep 16, 2009 at 10:39 AM, <eneal at businessgrade.com> wrote:> Quoting David Magda <dmagda at ee.ryerson.ca>: > > On Wed, September 16, 2009 10:31, Edward Ned Harvey wrote: >> >>> Hi. If I am using slightly more reliable SAS drives versus SATA, SSDs >>>> for both L2Arc and ZIL and lots of RAM, will a mirrored pool of say 24 >>>> disks hold any significant advantages over a RAIDZ pool? >>>> >>> >>> Generally speaking, striping mirrors will be faster than raidz or raidz2, >>> but it will require a higher number of disks and therefore higher cost to >>> get the same usable space. The main reason to use raidz or raidz2 >>> instead >>> of striping mirrors would be to keep the cost down, or to get higher >>> usable space out of a fixed number of drives. >>> >> >> And if you want space /and/ speed, then ZFS'' hybrid storage pools is >> something worth looking into. >> >> >> This is precisely my point. If I''m taking the hybrid approach - what > advantages do mirrored pools hold over RAIDZ? > As I mentioned, a large amount of RAM, and SSD''s for both L2arc and ZIL. > > > > -------------------------------------------------------------------------------- > > This email and any files transmitted with it are confidential and are > intended solely for the use of the individual or entity to whom they are > addressed. This communication may contain material protected by the > attorney-client privilege. If you are not the intended recipient, be advised > that any use, dissemination, forwarding, printing or copying is strictly > prohibited. If you have received this email in error, please contact the > sender and delete all copies. > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090916/03d934ee/attachment.html>
On Sep 16, 2009, at 9:38 AM, Marty Scholes wrote:>> Generally speaking, striping mirrors will be faster >> than raidz or raidz2, >> but it will require a higher number of disks and >> therefore higher cost to >> The main reason to use >> raidz or raidz2 instead >> of striping mirrors would be to keep the cost down, >> or to get higher usable >> space out of a fixed number of drives. > > While it has been a while since I have done storage management for > critical systems, the advantage I see with RAIDZN is better fault > tolerance: any N drives may fail before the set goes critical. > > With straight mirroring, failure of the wrong two drives will > invalidate the whole pool.This line of reasoning doesn''t get you very far. It is much better to take a look at the mean time to data loss (MTTDL) for the various configurations. I wrote a series of blogs to show how this is done. http://blogs.sun.com/relling/tags/mttdl -- richard
Mirrors are much quicker to replace if one DOES fail though...so i would think that bad stuff could happen with EITHER solution....If you buy a bunch of hard drives for a raidz and they are all from the same batch they might all fail around the same time...what if you have a raidz2 group and 2 drives fail, then you''re adding 2 drives back and another fails before it''s complete because it takes SO long to resilver? At least with mirrors they resilver fast. The bottom line is that bad stuff CAN happen and often does...so don''t let raidz or mirrors be the only solution you have. Redundancy is good. More redundancy is better... but backups are the best. On Wed, Sep 16, 2009 at 1:23 PM, Richard Elling <richard.elling at gmail.com>wrote:> On Sep 16, 2009, at 9:38 AM, Marty Scholes wrote: > >> Generally speaking, striping mirrors will be faster >>> than raidz or raidz2, >>> but it will require a higher number of disks and >>> therefore higher cost to >>> The main reason to use >>> raidz or raidz2 instead >>> of striping mirrors would be to keep the cost down, >>> or to get higher usable >>> space out of a fixed number of drives. >>> >> >> While it has been a while since I have done storage management for >> critical systems, the advantage I see with RAIDZN is better fault tolerance: >> any N drives may fail before the set goes critical. >> >> With straight mirroring, failure of the wrong two drives will invalidate >> the whole pool. >> > > This line of reasoning doesn''t get you very far. It is much better to take > a look at > the mean time to data loss (MTTDL) for the various configurations. I wrote > a > series of blogs to show how this is done. > http://blogs.sun.com/relling/tags/mttdl > > -- richard > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090916/96d37119/attachment.html>
On Sep 16, 2009, at 10:42 AM, Thomas Burgess wrote:> Mirrors are much quicker to replace if one DOES fail though...so i > would think that bad stuff could happen with EITHER solution....If > you buy a bunch of hard drives for a raidz and they are all from the > same batch they might all fail around the same time...what if you > have a raidz2 group and 2 drives fail, then you''re adding 2 drives > back and another fails before it''s complete because it takes SO long > to resilver? At least with mirrors they resilver fast.In general, resilver is bound by either the media write bandwidth of the resilvering device or the random IOP capacity of the remaining good drives. Although I don''t know of any studies comparing mirrors vs raidz resilvering, I would not expect much difference between the two, all else held constant. -- richard
hrm, i always thought raidz took longer....learn something every day =) On Wed, Sep 16, 2009 at 2:14 PM, Richard Elling <richard.elling at gmail.com>wrote:> On Sep 16, 2009, at 10:42 AM, Thomas Burgess wrote: > > Mirrors are much quicker to replace if one DOES fail though...so i would >> think that bad stuff could happen with EITHER solution....If you buy a bunch >> of hard drives for a raidz and they are all from the same batch they might >> all fail around the same time...what if you have a raidz2 group and 2 drives >> fail, then you''re adding 2 drives back and another fails before it''s >> complete because it takes SO long to resilver? At least with mirrors they >> resilver fast. >> > > In general, resilver is bound by either the media write bandwidth of the > resilvering device > or the random IOP capacity of the remaining good drives. Although I don''t > know of any > studies comparing mirrors vs raidz resilvering, I would not expect much > difference between > the two, all else held constant. > -- richard > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090916/f36e371a/attachment.html>
> This line of reasoning doesn't get you very far. > It is much better to take a look at<br> > the mean time to data loss (MTTDL) for the various > configurations. I wrote a<br> > series of blogs to show how this is done.<br> > <a href="http://blogs.sun.com/relling/tags/mttdl" target="_blank">http://blogs.sun.com/relling/tags/mttdl</a><br><br>I will play the Devils advocate here and point out that the chart shows MTTDL for RAIDZ2, both 6 and 8 disk, is much better than mirroring. The chart does show that three way mirroring is better still and I would guess that RAIDZ3 surpasses that. -- This message posted from opensolaris.org
On Sep 16, 2009, at 12:50 PM, Marty Scholes wrote:>> This line of reasoning doesn't get you very far. >> It is much better to take a look at<br> >> the mean time to data loss (MTTDL) for the various >> configurations. I wrote a<br> >> series of blogs to show how this is done.<br> >> <a href="http://blogs.sun.com/relling/tags/mttdl" target="_blank">http://blogs.sun.com/relling/tags/mttdl >> </a><br><br> > > I will play the Devils advocate here and point out that the chart > shows MTTDL for RAIDZ2, both 6 and 8 disk, is much better than > mirroring. > > The chart does show that three way mirroring is better still and I > would guess that RAIDZ3 surpasses that.Yes. This is a mathematical way of saying "lose any P+1 of N disks." The important part is that the number of parity disks (or mirror sides) is the big knob to use. But every choice is a trade-off. For a single set, the results should be intuitive. But as you vary the number of sets, it quickly becomes easier to use the models. For example, with a Thumper, you have 48 disks and zillions of possible combinations to choose from. -- richard
On Wed, 16 Sep 2009, Thomas Burgess wrote:> hrm, i always thought raidz took longer....learn something every day =)And you were probably right, in spite of Richard''s lack of knowledge of a study or the feeling in his gut. Just look at the many postings here about resilvering and you will see far more complaints about raidz taking a long time. Resilver of mirrors will surely do better for large pools which continue to be used during the resilvering. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> Yes. This is a mathematical way of saying > "lose any P+1 of N disks."I am hesitant to beat this dead horse, yet it is a nuance that either I have completely misunderstood or many people I''ve met have completely missed. Whether a stripe of mirrors or mirror of a stripes, any single failure makes the array critical, i.e. one failure from disaster. For example, suppose a stripe of four sets of mirrors. That stripe has 8 disks total: four data and four mirrors. If one disk fails, say on mirror set 3, then set 3 is running on a single disk. Should that remaining disk in set 3 fail, the whole stripe is lost. Yes, the stripe is safe as long as the next failure is not from set 3. Contrast that to RAIDZ3. Suppose seven total disks with the same effective pool size: 4 data and 3 parity. If any single disk is lost then the array is not critical and can still survive any other loss. In fact, it can survive a total of any three disk failures before it becomes critical. I just see it too often where someone states that a stripe of four mirror sets can sustain four disk failures. Yes, that''s true, as long as the correct four disks fail. If we could control which disks fail, then none of this would even be necessary, so that argument seems rather silly.
On Sep 16, 2009, at 1:09 PM, Bob Friesenhahn wrote:> On Wed, 16 Sep 2009, Thomas Burgess wrote: > >> hrm, i always thought raidz took longer....learn something every >> day =) > > And you were probably right, in spite of Richard''s lack of knowledge > of a study or the feeling in his gut. Just look at the many > postings here about resilvering and you will see far more complaints > about raidz taking a long time.Actually, I had a ton of data on resilvering which shows mirrors and raidz equivalently bottlenecked on the media write bandwidth. However, there are other cases which are IOPS bound (or CR bound :-) which cover some of the postings here. I think Sommerfeld has some other data which could be pertinent. -- richard
On Sep 16, 2009, at 1:29 PM, Marty Scholes wrote:>> Yes. This is a mathematical way of saying >> "lose any P+1 of N disks." > > I am hesitant to beat this dead horse, yet it is a nuance that > either I have completely misunderstood or many people I''ve met have > completely missed. > > Whether a stripe of mirrors or mirror of a stripes, any single > failure makes the array critical, i.e. one failure from disaster. > > For example, suppose a stripe of four sets of mirrors. That stripe > has 8 disks total: four data and four mirrors. If one disk fails, > say on mirror set 3, then set 3 is running on a single disk. Should > that remaining disk in set 3 fail, the whole stripe is lost. Yes, > the stripe is safe as long as the next failure is not from set 3.Yes. I don''t think I''ve blogged the data, but the MTTDL models will show that RAID-1+0 has a higher MTTDL than RAID-0+1.> Contrast that to RAIDZ3. Suppose seven total disks with the same > effective pool size: 4 data and 3 parity. If any single disk is > lost then the array is not critical and can still survive any other > loss. In fact, it can survive a total of any three disk failures > before it becomes critical.Yes, but can you quantify this? 2x better? 5x better? 1.01x better? The MTTDL models can help you quantify this.> I just see it too often where someone states that a stripe of four > mirror sets can sustain four disk failures. Yes, that''s true, as > long as the correct four disks fail. If we could control which > disks fail, then none of this would even be necessary, so that > argument seems rather silly.The MTTDL models account for this. -- richard
On 09/16/09 14:19, Richard Elling wrote:> On Sep 16, 2009, at 1:09 PM, Bob Friesenhahn wrote: > >> On Wed, 16 Sep 2009, Thomas Burgess wrote: >> >>> hrm, i always thought raidz took longer....learn something every day =) >> >> And you were probably right, in spite of Richard''s lack of knowledge >> of a study or the feeling in his gut. Just look at the many postings >> here about resilvering and you will see far more complaints about >> raidz taking a long time. > > Actually, I had a ton of data on resilvering which shows mirrors and > raidz equivalently bottlenecked on the media write bandwidth. However, > there are other cases which are IOPS bound (or CR bound :-) which > cover some of the postings here. I think Sommerfeld has some other > data which could be pertinent.This primarily has to do with the stripe width and block size. The difference between mirroring and RAID-Z is that with RAID-Z each ZFS block is again chunked up into smaller blocks and distributed across the stripe. So if you have a wide stripe (i.e. 32), a 128k block can be chunked up into 4k blocks, while a small recordsize can be chunked even smaller (i.e. 8k to 1k or 512). ZFS resilvering is metadata based to allow for efficient resilvering of outages, but when a relatively full disk needs to be replaced you end up bottlenecked on the metadata traversal. If your blocks are chunked up small enough, this becomes a random I/O benchmark for the good disks in the RAID stripe. If your pool is backed by 7200 RPM disks, this can end up taking a very long time. The ZFS team is actively working on improvements in this area. - Eric -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
On Sep 16, 2009, at 4:29 PM, "Marty Scholes" <martyscholes at yahoo.com> wrote:>> Yes. This is a mathematical way of saying >> "lose any P+1 of N disks." > > I am hesitant to beat this dead horse, yet it is a nuance that > either I have completely misunderstood or many people I''ve met have > completely missed. > > Whether a stripe of mirrors or mirror of a stripes, any single > failure makes the array critical, i.e. one failure from disaster. > > For example, suppose a stripe of four sets of mirrors. That stripe > has 8 disks total: four data and four mirrors. If one disk fails, > say on mirror set 3, then set 3 is running on a single disk. Should > that remaining disk in set 3 fail, the whole stripe is lost. Yes, > the stripe is safe as long as the next failure is not from set 3. > > Contrast that to RAIDZ3. Suppose seven total disks with the same > effective pool size: 4 data and 3 parity. If any single disk is > lost then the array is not critical and can still survive any other > loss. In fact, it can survive a total of any three disk failures > before it becomes critical. > > I just see it too often where someone states that a stripe of four > mirror sets can sustain four disk failures. Yes, that''s true, as > long as the correct four disks fail. If we could control which > disks fail, then none of this would even be necessary, so that > argument seems rather silly.There is another type of failure that mirrors help with and that is controller or path failures. If one side of a mirror set is on one controller or path and the other on another then a failure of one will not take down the set. You can''t get that with RAIDZn. -Ross
On Wed, 16 Sep 2009, Ross Walker wrote:> > There is another type of failure that mirrors help with and that is > controller or path failures. If one side of a mirror set is on one controller > or path and the other on another then a failure of one will not take down the > set. > > You can''t get that with RAIDZn.Sure you can. Just make sure that ''n'' is the same as the number of data disks, and make sure that each disk in the vdev is accessed via a unique controller path. Use raidz3 with six disks. You probably need a lot of vdevs to make this even somewhat cost effective. :-) Regardless, mirrors are known to be more resilient to temporary path failures. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
rswwalker at gmail.com said:> There is another type of failure that mirrors help with and that is > controller or path failures. If one side of a mirror set is on one > controller or path and the other on another then a failure of one will not > take down the set. > > You can''t get that with RAIDZn.You can if you have a stripe of RAIDZn''s, and enough controllers (or paths) to go around. The raidz2 below should be able to survive the loss of two controllers, shouldn''t it? Regards, Marion $ zpool status -v pool: zp1 state: ONLINE scrub: scrub completed after 7h9m with 0 errors on Mon Sep 14 13:39:03 2009 config: NAME STATE READ WRITE CKSUM bulk_zp01 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c6t2d0 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c4t4d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c6t4d0 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c4t6d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 c6t6d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c4t7d0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 spares c0t0d0 AVAIL c1t0d0 AVAIL c4t0d0 AVAIL c7t0d0 AVAIL errors: No known data errors $
On Sep 16, 2009, at 6:50 PM, Marion Hakanson <hakansom at ohsu.edu> wrote:> rswwalker at gmail.com said: >> There is another type of failure that mirrors help with and that is >> controller or path failures. If one side of a mirror set is on one >> controller or path and the other on another then a failure of one >> will not >> take down the set. >> >> You can''t get that with RAIDZn. > > You can if you have a stripe of RAIDZn''s, and enough controllers > (or paths) to go around. The raidz2 below should be able to survive > the loss of two controllers, shouldn''t it?It''s not the stripes that make a difference, but the number of controllers there. What''s the system config on that puppy? -Ross
On Sep 16, 2009, at 6:43 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us > wrote:> On Wed, 16 Sep 2009, Ross Walker wrote: >> >> There is another type of failure that mirrors help with and that is >> controller or path failures. If one side of a mirror set is on one >> controller or path and the other on another then a failure of one >> will not take down the set. >> >> You can''t get that with RAIDZn. > > Sure you can. Just make sure that ''n'' is the same as the number of > data disks, and make sure that each disk in the vdev is accessed via > a unique controller path. Use raidz3 with six disks. You probably > need a lot of vdevs to make this even somewhat cost effective. :-)Well yes, if you have an equal number of parity disks to data disks it would survive, but at that point what''s the cost effectiveness to resilency ratio?> Regardless, mirrors are known to be more resilient to temporary path > failures.As another list member pointed out you could also avoid the issue by having a raidz disk per controller. But if I''m buying that kind of big iron I might just opt for a 3par or emc and save myself the work, and probably some $ too. -Ross
On Sep 16, 2009, at 7:17 PM, Ross Walker wrote:>> more resilient to temporary path failures. > > As another list member pointed out you could also avoid the issue by > having a raidz disk per controller. But if I''m buying that kind of > big iron I might just opt for a 3par or emc and save myself the > work, and probably some $ too.In general, for SAS or SATA, having separate controllers does little to improve data availability. The reason is because SAS and SATA are point-to-point or point-to-switch-to-point architectures and you don''t have the shared bus issues that plague parallel SCSI or IDE. The controllers themselves are approximately an order of magnitude more reliable than your CPU and are around two orders of magnitude more reliable than your disk. Put your redundancy where your reliability is weak (disk), if you want to improve availability. http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_vs -- richard
On Wed, Sep 16, 2009 at 08:02:35PM +0300, Markus Kovero wrote:> It''s possible to do 3-way (or more) mirrors too, so you may achieve better redundancy than raidz2/3I understand there''s almost no additional performance penalty to raidz3 over raidz2 in terms of CPU load. Is that correct? So SSDs for ZIL/L2ARC don''t bring that much when used with raidz2/raidz3, if I write a lot, at least, and don''t access the cache very much, according to some recent posts on this list. How much drive space am I''m losing with mirrored pools versus raidz3? IIRC in RAID 10 it''s only 10% over RAID 6, which is why I went for RAID 10 in my 14-drive SATA (WD RE4) setup. Let''s assume I want to fill a 24-drive Supermicro chassis with 1 TByte WD Caviar Black or 2 TByte RE4 drives, and use 4x X25-M 80 GByte 2nd gen Intel consumer drives, mirrored, each pair as ZIL/L2ARC for the 24 SATA drives behind them. Let''s assume CPU is not an issue, with dual-socket Nehalems and 24 GByte RAM or more. There are applications packaged in Solaris containers running on the same box, however. Let''s say the workload is mostly multiple streams (hundreds to thousands simultaneously, some continuous, some bursty) each writing data to the storage system. However, some few clients will be using database-like queries to read, potentially on the entire data store. With above workload, is raidz2/raid3 right out, and will I need mirrored pools? How would you lay out the pools for above workload, assuming 24 SATA drives/chassis (24-48 TBytes raw storage), and 80 GByte SSD each for ZIL/L2ARC (is that too little? Would 160 GByte work better?) Thanks lots. -- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
On Wed, Sep 16, 2009 at 10:23:01AM -0700, Richard Elling wrote:> This line of reasoning doesn''t get you very far. It is much better to > take a look at > the mean time to data loss (MTTDL) for the various configurations. I > wrote a > series of blogs to show how this is done. > http://blogs.sun.com/relling/tags/mttdlExcellent information, thanks! I presume MTTDL[1] years and MTTDL[2] is the same as in http://blogs.sun.com/relling/entry/a_story_of_two_mttdl Do you think it would be possible to publish same information for 24 drives (not all of us can buy a Thumper), and maybe include raidz3 into the number crunch? Thanks! -- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
On 17 September, 2009 - Eugen Leitl sent me these 2,0K bytes:> On Wed, Sep 16, 2009 at 08:02:35PM +0300, Markus Kovero wrote: > > > It''s possible to do 3-way (or more) mirrors too, so you may achieve better redundancy than raidz2/3 > > I understand there''s almost no additional performance penalty to raidz3 > over raidz2 in terms of CPU load. Is that correct? > > So SSDs for ZIL/L2ARC don''t bring that much when used with raidz2/raidz3, > if I write a lot, at least, and don''t access the cache very much, according > to some recent posts on this list. > > How much drive space am I''m losing with mirrored pools versus raidz3? IIRC > in RAID 10 it''s only 10% over RAID 6, which is why I went for RAID 10 in > my 14-drive SATA (WD RE4) setup.It''s not a fixed value per technology, it depends on the number of disks per group. RAID5/RAIDZ1 "loses" 1 disk worth to parity per group. RAID6/RAIDZ" loses 2 disks. RAIDZ3 loses 3 disks. Raid1/mirror loses half the disks. So in your 14 drive case, if you go for one big raid6/raidz2 setup (which is larger than recommended for performance reasons), you will lose 2 disks worth of storage to parity leaving 12 disks worth of data. With raid10 you will lose half, 7 disks to parity/redundancy. With two raidz2 sets, you will get (5+2)+(5+2), that is 5+5 disks worth of storage and 2+2 disks worth of redundancy. The actual redudancy/parity is spread over all disks, not like raid3 which has a dedicated parity disk. For more info, see for example http://en.wikipedia.org/wiki/RAID /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
Eugen Leitl wrote:> On Wed, Sep 16, 2009 at 08:02:35PM +0300, Markus Kovero wrote: > > >> It''s possible to do 3-way (or more) mirrors too, so you may achieve better redundancy than raidz2/3 >> > > I understand there''s almost no additional performance penalty to raidz3 > over raidz2 in terms of CPU load. Is that correct? >As far as I understand the z3 algorithms, the performance penalty is very slightly higher than z2. I think it''s reasonable to treat z1, z2, and z3 as equal in terms of CPU load.> So SSDs for ZIL/L2ARC don''t bring that much when used with raidz2/raidz3, > if I write a lot, at least, and don''t access the cache very much, according > to some recent posts on this list. >Not true. Remember: ZIL = write cache L2ARC = read cache So, if you have a write-heavy workload which seldom does much more than large reads, an L2ARC SSD doesn''t make much sense. Main RAM should suffice for storing the read cache. Random reads aren''t fast on RAIDZ, so a read cache is a good thing if you are doing that kind of I/O. Similarly, random writes (particularly small random writes) are suck hard on RADIZ, so a write cache is a fabulous idea there. If you are doing very large sequential writes to a RAIDZ (any sort) pool, then a write cache will likely be much less helpful. But remember, very large means that you frequently exceed the size of the SSD you''ve allocated for the ZIL. I''d have to run the numbers, but you should still see a major performance improvement by using a SSD for ZIL, up to the point where your typical write load exceeds 10% of the size of SSD. Naturally, write-heavy workloads will be murder on a MLC or hybrid SSD''s life expectancy, though, a large sequential-write-heavy load will allow the SSD to perform better and longer than a small random write load. A write SSD will help you up until you try to write to the SSD faster than it can flush out it''s contents to actual disk. So, you need to take into consideration exactly how much data is coming in, and the write speed of your (non-SSD) disks. If you are continuously (and constantly) exceeding the speed of your disks with incoming data, then SSDs won''t really help. You''ll see some help up until the SSD fills up, then performance will drop to equal that as if the SSD didn''t exist. Doing [very] rough calculations, let''s say your SSD has a read/write throughput of 200MB/s, and is 100GB in size. If your hard drives can only do 50MB/s, then you can write up to 150MB/s to the SSD, read 50MB/s from the SSD, and write 50MB/s to the disks. This means, each second, you fill the SSD with 100MB more data that can''t be flushed out fast enough. At 100MB/s, it takes 1,000 seconds to fill 100GB. So, in about 17 minutes, you''ve completely filled the SSD, and performance drops like a rock. There is a similar cliff problem around IOPS.> How much drive space am I''m losing with mirrored pools versus raidz3? IIRC > in RAID 10 it''s only 10% over RAID 6, which is why I went for RAID 10 in > my 14-drive SATA (WD RE4) setup. >Basic math says for N disks, you get N-3 amount of space for a RAIDZ3, and N/2 for a 2-way mirror. N-3 > N/2 for all N = 6 or more. But, remember, you''ll generally need at least one hot spare for a mirror, so really, the equations looks like this: N-3 > (N/2) -1 which means, RAIDZ3 gives you more space for N > 4> Let''s assume I want to fill a 24-drive Supermicro chassis with 1 TByte > WD Caviar Black or 2 TByte RE4 drives, and use 4x X25-M 80 GByte > 2nd gen Intel consumer drives, mirrored, each pair as ZIL/L2ARC > for the 24 SATA drives behind them. Let''s assume CPU is not an issue, > with dual-socket Nehalems and 24 GByte RAM or more. There are applications > packaged in Solaris containers running on the same box, however. >Remember to take a look at Richard''s spreadsheet about drive errors and the amount of time you can expect to go without serious issue. He''s also got good stuff about optimizing for speed vs space. http://blogs.sun.com/relling/ Quick math for a 24-drive setup: Scenario A: stripe of mirrors, plus global spares. 11 x 2-way mirror = 11 disks of data, plus 2 additional hot spares Scenario B: stripe of raidz3, no global spares 3 x 8-drive RAIDz3 (5 data + 3 parity drives )= 3 x 5 = 15 data drives, with a total of 9 internal "spares" Thus, A gives you about 30% less disk space than B.> Let''s say the workload is mostly multiple streams (hundreds to thousands > simultaneously, some continuous, some bursty) each writing data > to the storage system. However, some few clients will be using database-like > queries to read, potentially on the entire data store. > > With above workload, is raidz2/raid3 right out, and will I need mirrored > pools? >The database queries will definitely benefit from a L2ARC SSD - the size of that SSD depends on exactly how much data the query has to check. If it''s just checking metadata (mod times, file sizes, permissions, etc.) of lots of files, then you''re probably good with a smaller SSD. If you have to actually read large amounts of the streams, then you''re pretty well hosed, as your data set is far larger than any cache can hold.> How would you lay out the pools for above workload, assuming 24 SATA > drives/chassis (24-48 TBytes raw storage), and 80 GByte SSD each for ZIL/L2ARC > (is that too little? Would 160 GByte work better?) > > Thanks lots. >I can''t make recommendations about SSD size without much more specific numbers about the actual workload. Look at Richard''s Raid Optimizer output for a 48-disk Thumper/Thor. http://blogs.sun.com/relling/entry/sample_raidoptimizer_output It should give you a good idea about IOPS and read/write speeds for various configs. Reading/Writing a large stream is a sequential operation. Bursty read/write of a stream looks like random I/O. But, more importantly, the relative size of the stream is important. Whether continuous or bursty, the important characteristic is HOW MUCH data needs to be written/read at once. Anything under 100k is definitely "random", and anything over 10MB is "sequential" (as far as general performance goes). Sizes in between makes it depend on how much other stuff is going on (i.e. having 100,000 streams each trying to write 1MB has a different impact than 1,000 streams trying to write the same 1MB each). Personally, I hate to use any form of SATA drive for a heavy random write or read workload, even with an SSD. SAS disks performs so much better. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
On Thu, Sep 17, 2009 at 12:55:35PM +0200, Tomas ?gren wrote:> It''s not a fixed value per technology, it depends on the number of disks > per group. RAID5/RAIDZ1 "loses" 1 disk worth to parity per group. > RAID6/RAIDZ" loses 2 disks. RAIDZ3 loses 3 disks. Raid1/mirror loses > half the disks. So in your 14 drive case, if you go for one big > raid6/raidz2 setup (which is larger than recommended for performanceI presume for 24 disks (my next project, the current 16-disk one had to be converted to CentOS for software compatibility reasons) you would recommend splitting them into two groups, a la 12 disks. With raidz3, there would be 9 disks left for data, 18 total -- 36 TBytes effective in case of 2 TByte WD RE4 drives, half that for WD Caviar Black. How many hot spares should I leave in each pool, one or more? Is it safe to stripe over two such 12-disk pools? Or is mirror the right thing to do, regardless of drive costs? Speaking of which, does anyone use NFSv4 clustering in production to aggregate individual zfs boxes? Experiences good/bad?> reasons), you will lose 2 disks worth of storage to parity leaving 12 > disks worth of data. With raid10 you will lose half, 7 disks to > parity/redundancy. With two raidz2 sets, you will get (5+2)+(5+2), that > is 5+5 disks worth of storage and 2+2 disks worth of redundancy. The > actual redudancy/parity is spread over all disks, not like raid3 which > has a dedicated parity disk.So raidz3 has a dedicated parity disk? I couldn''t see that from skimming http://blogs.sun.com/ahl/entry/triple_parity_raid_z> For more info, see for example http://en.wikipedia.org/wiki/RAIDUnfortunately, this is very thin on zfs. http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide is very helpful, but it doesn''t offer concrete layout examples for odd number of disks (understandable, since Sun has to sell the Thumper), and is pretty mum on raidz3. Thank you. This list is fun, and helpful. -- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
Erik Trimble wrote:>> So SSDs for ZIL/L2ARC don''t bring that much when used with raidz2/raidz3, >> if I write a lot, at least, and don''t access the cache very much, >> according >> to some recent posts on this list. >> > Not true. > > Remember: ZIL = write cacheZIL is NOT a write cache. The ZIL is the Intent Log not a cache. It is used only for synchronous writes. It is not a cache because the term "cache" implies the data is also somewhere else and you lose nothing but potential performance if you loose the cache. ZFS calls the devices used to hold the ZIL (there is one ZIL per dataset) a SLOG (Separate Log device). Note also the recent addition of the "logbias" dataset property. -- Darren J Moffat
Darren J Moffat wrote:> Erik Trimble wrote: >>> So SSDs for ZIL/L2ARC don''t bring that much when used with >>> raidz2/raidz3, >>> if I write a lot, at least, and don''t access the cache very much, >>> according >>> to some recent posts on this list. >>> >> Not true. >> >> Remember: ZIL = write cache > > ZIL is NOT a write cache. The ZIL is the Intent Log not a cache. It > is used only for synchronous writes. It is not a cache because the > term "cache" implies the data is also somewhere else and you lose > nothing but potential performance if you loose the cache. > > ZFS calls the devices used to hold the ZIL (there is one ZIL per > dataset) a SLOG (Separate Log device). > > Note also the recent addition of the "logbias" dataset property. >I should have more properly used the term "buffer", which is what ZIL is more closely related to. Sorry about that - I didn''t mean to imply that the ZIL was the same as something like a STK6140''s NVRAM. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
rswwalker at gmail.com said:> It''s not the stripes that make a difference, but the number of controllers > there. > > What''s the system config on that puppy?The "zpool status -v" output was from a Thumper (X4500), slightly edited, since in our real-world Thumper, we use c6t0d0 in c5t4d0''s place in the "optimal" layout I posted, because c5t4d0 is used in the boot-drive mirror. See the following for our 2006 Thumper benchmarks, which appear to bear out Richard Elling''s RaidOptimizer analysis: http://acc.ohsu.edu/~hakansom/thumper_bench.html While I''m at it, filebench numbers from a recent J4400-based database server deployment, with some "slog vs no-slog" comparisons (sorry, no SSD''s available here yet): http://acc.ohsu.edu/~hakansom/j4400_bench.html Regards, Marion
On Thu, Sep 17, 2009 at 01:32:43PM +0200, Eugen Leitl wrote:> > reasons), you will lose 2 disks worth of storage to parity leaving 12 > > disks worth of data. With raid10 you will lose half, 7 disks to > > parity/redundancy. With two raidz2 sets, you will get (5+2)+(5+2), that > > is 5+5 disks worth of storage and 2+2 disks worth of redundancy. The > > actual redudancy/parity is spread over all disks, not like raid3 which > > has a dedicated parity disk. > > So raidz3 has a dedicated parity disk? I couldn''t see that from > skimming http://blogs.sun.com/ahl/entry/triple_parity_raid_zNote that Tomas was talking about RAID-3 not raidz3. To summarize the RAID levels: RAID-0 striping RAID-1 mirror RAID-2 ECC (basically not used) RAID-3 bit-interleaved parity (basically not used) RAID-4 block-interleaved parity RAID-5 block-interleaved distributed parity RAID-6 block-interleaved double distributed parity raidz1 is most like RAID-5; raidz2 is most like RAID-6. There''s no RAID level that covers more than two parity disks, but raidz3 is most like RAID-6, but with triple distributed parity. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
On Thu, Sep 17, 2009 at 11:41 AM, Adam Leventhal <ahl at eng.sun.com> wrote:> ?RAID-3 ? ? ? ?bit-interleaved parity (basically not used)There was a hardware RAID chipset that used RAID-3. Netcell Revolution I think it was called. It looked interesting and I thought about grabbing one at the time but never got around to it. Netcell is defunct or got bought out, so the controller is no longer available. -B -- Brandon High : bhigh at freaks.com Always try to do things in chronological order; it''s less confusing that way.
On Wed, 2009-09-16 at 14:19 -0700, Richard Elling wrote:> Actually, I had a ton of data on resilvering which shows mirrors and > raidz equivalently bottlenecked on the media write bandwidth. However, > there are other cases which are IOPS bound (or CR bound :-) which > cover some of the postings here. I think Sommerfeld has some other > data which could be pertinent.I''m not sure I have data, but I have anecdotes and observations, and a few large production pools used for solaris development by me and my coworkers. the biggest one (by disk count) takes 80-100 hours to scrub and/or resilver. my working hypothesis is that resilver of pools which: 1) have a lot of files, directories, filesystems, and periodic snapshots 2) have atime updates enabled (default config) 3) have regular (daily) jobs doing large-scale filesystem tree-walks wind up rewriting most blocks of the dnode files on every tree walk doing atime updates, and as a result the dnode file (but not most of the blocks it points to) differs greatly from daily snapshot to daily snapshot. as a result, scrub/resilver traversals end up spending most of their time doing random reads of the dnode files of each snapshot. here are some bugs that, if fixed, might help: 6678033 resilver code should prefetch 6730737 investigate colocating directory dnodes - Bill