Folks, If I have 20 disks to build a raidz3 pool, do I create one big raidz vdev or do I create multiple raidz3 vdevs? Is there any advantage of having multiple raidz3 vdevs in a single pool? Thank you in advance for your help. Regards, Peter -- This message posted from opensolaris.org
Hello Peter, Read the ZFS Best Practices Guide to start. If you still have questions, post back to the list. http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Storage_Pool_Performance_Considerations -Scott On Oct 13, 2010, at 3:21 PM, Peter Taps wrote:> Folks, > > If I have 20 disks to build a raidz3 pool, do I create one big raidz vdev or do I create multiple raidz3 vdevs? Is there any advantage of having multiple raidz3 vdevs in a single pool? > > Thank you in advance for your help. > > Regards, > Peter > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussScott Meilicke
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Peter Taps > > If I have 20 disks to build a raidz3 pool, do I create one big raidz > vdev or do I create multiple raidz3 vdevs? Is there any advantage of > having multiple raidz3 vdevs in a single pool?whatever you do, *don''t* configure one huge raidz3. Consider either: 3 vdev''s of each 7-disk raidz1, or 3 vdev''s of 7-disk raidz2, or something along these lines. Perhaps 3 vdev''s of each 6-disk raidz1, and two hotspares. raidzN takes a really long time to resilver (code written inefficiently, it''s a known problem.) If you had a huge raidz3, it would literally never finish, because it couldn''t resilver as fast as new data appears. A week later you''d destroy & rebuild your whole pool. If you can afford mirrors, your risk is much lower. Because although it''s physically possible for 2 disks to fail simultaneously and ruin the pool, the probability of that happening is smaller than the probability of 3 simultaneous disk failures on the raidz3. Due to smaller resilver window. I highly endorse mirrors for nearly all purposes.
On Wed, October 13, 2010 21:26, Edward Ned Harvey wrote:> I highly endorse mirrors for nearly all purposes.Are you a member of BAARF? http://www.miracleas.com/BAARF/BAARF2.html :)
> From: David Magda [mailto:dmagda at ee.ryerson.ca] > > On Wed, October 13, 2010 21:26, Edward Ned Harvey wrote: > > > I highly endorse mirrors for nearly all purposes. > > Are you a member of BAARF? > > http://www.miracleas.com/BAARF/BAARF2.htmlNever heard of it. I don''t quite get it ... They want people to stop talking about pros/cons of various types of raid? That''s definitely not me. I think there are lots of pros/cons, and many of them have nuances, and vary by implementation... I think it''s important to keep talking about it, and all us "experts" in the field can keep current on all this ... Take, for example, the number of people discussing things in this mailing list, who say they still use hardware raid. That alone demonstrates misinformation (in most cases) and warrants more discussion. ;-)
On Wed, 13 Oct 2010, Edward Ned Harvey wrote:> > raidzN takes a really long time to resilver (code written inefficiently, > it''s a known problem.) If you had a huge raidz3, it would literally never > finish, because it couldn''t resilver as fast as new data appears. A weekIn what way is the code written inefficiently? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Sorry, I can''t not respond... Edward Ned Harvey wrote:> whatever you do, *don''t* configure one huge raidz3.Peter, whatever you do, *don''t* make a decision based on blanket generalizations.> If you can afford mirrors, your risk is much lower. > Because although it''s > hysically possible for 2 disks to fail simultaneously > and ruin the pool, > the probability of that happening is smaller than the > probability of 3 > simultaneous disk failures on the raidz3.Edward, I normally agree with most of what you have to say, but this has gone off the deep end. I can think of counter-use-cases far faster than I can type.> Due to > smaller resilver window.Coupled with a smaller MTTDL, smaller cabinet space yield, smaller $/GB ratio, etc.> I highly endorse mirrors for nearly all purposes.Clearly. Peter, go straight to the source. http://blogs.sun.com/roch/entry/when_to_and_not_to In short: 1. vdev_count = spindle_count / (stripe_width + parity_count) 2. IO/s is proprotional to vdev_count 3. Usable capacity is proportional to stripe_width * vdev_count 4. A mirror can be approximated by a stripe of width one 5. Mean time to data loss increases exponentially with parity_count 6. Resilver time increases (super)linearly with stripe width Balance capacity available, storage needed, performance needed and your own level of paranoia regarding data loss. My home server''s main storage is a 22 (19 + 3) disk RAIDZ3 pool backed up hourly to a 14 (11+3) RAIDZ3 backup pool. Clearly this is not a production Oracle server. Equally clear is that my paranoia index is rather high. ZFS will let you choose the combination of stripe width and parity count which works for you. There is no "one size fits all." -- This message posted from opensolaris.org
On Fri, Oct 15, 2010 at 3:16 PM, Marty Scholes <martyscholes at yahoo.com> wrote:> My home server''s main storage is a 22 (19 + 3) disk RAIDZ3 pool backed up hourly to a 14 (11+3) RAIDZ3 backup pool.How long does it take to resilver a disk in that pool? And how long does it take to run a scrub? When I initially setup a 24-disk raidz2 vdev, it died trying to resilver a single 500 GB SATA disk. I/O under 1 MBps, all 24 drives thrashing like crazy, could barely even login to the system and type onscreen. It was a nightmare. That, and normal (no scrub, no resilver) disk I/O was abysmal. Since then, I''ve avoided any vdev with more than 8 drives in it. -- Freddie Cash fjwcash at gmail.com
> On Fri, Oct 15, 2010 at 3:16 PM, Marty Scholes > <martyscholes at yahoo.com> wrote: > > My home server''s main storage is a 22 (19 + 3) disk > RAIDZ3 pool backed up hourly to a 14 (11+3) RAIDZ3 > backup pool. > > How long does it take to resilver a disk in that > pool? And how long > does it take to run a scrub? > > When I initially setup a 24-disk raidz2 vdev, it died > trying to > resilver a single 500 GB SATA disk. I/O under 1 > MBps, all 24 drives > thrashing like crazy, could barely even login to the > system and type > onscreen. It was a nightmare. > > That, and normal (no scrub, no resilver) disk I/O was > abysmal. > > Since then, I''ve avoided any vdev with more than 8 > drives in it.MY situation is kind of unique. I picked up 120 15K 73GB FC disks early this year for $2 per. As such, spindle count is a non-issue. As a home server, it has very little need for write iops and I have 8 disks for L2ARC on the main pool. Main pool is at 40% capacity and backup pool is at 65% capacity. Both take about 70 minutes to scrub. The last time I tested a resilver it took about 3 hours. The difference is that these are low capacity 15K FC spindles and the pool has very little sustained I/O; it only bursts now and again. Resilvers would go mostly uncontested, and with RAIDZ3 + autoreplace=off, I can actually schedule a resilver. -- This message posted from opensolaris.org
On 10/16/10 12:29 PM, Marty Scholes wrote:>> On Fri, Oct 15, 2010 at 3:16 PM, Marty Scholes >> <martyscholes at yahoo.com> wrote: >> >>> My home server''s main storage is a 22 (19 + 3) disk >>> >> RAIDZ3 pool backed up hourly to a 14 (11+3) RAIDZ3 >> backup pool. >> >> How long does it take to resilver a disk in that >> pool? And how long >> does it take to run a scrub? >> >> When I initially setup a 24-disk raidz2 vdev, it died >> trying to >> resilver a single 500 GB SATA disk. I/O under 1 >> MBps, all 24 drives >> thrashing like crazy, could barely even login to the >> system and type >> onscreen. It was a nightmare. >> >> That, and normal (no scrub, no resilver) disk I/O was >> abysmal. >> >> Since then, I''ve avoided any vdev with more than 8 >> drives in it. >> > MY situation is kind of unique. I picked up 120 15K 73GB FC disks early this year for $2 per. As such, spindle count is a non-issue. As a home server, it has very little need for write iops and I have 8 disks for L2ARC on the main pool. > >I''d hate to be paying your power bill!> Main pool is at 40% capacity and backup pool is at 65% capacity. Both take about 70 minutes to scrub. The last time I tested a resilver it took about 3 hours. > >So a tiny fast drive takes three hours, consider how long a 30x bigger, much slower drive will take. -- Ian.
> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us] > > > raidzN takes a really long time to resilver (code written > inefficiently, > > it''s a known problem.) If you had a huge raidz3, it would literally > never > > finish, because it couldn''t resilver as fast as new data appears. A > week > > In what way is the code written inefficiently?Here is a link to one message in the middle of a really long thread, which touched on a lot of things, so it''s difficult to read the thread now and get what it all boils down to and which parts are relevant to the present discussion. Relevant comments below... http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg41998.html In conclusion of the referenced thread: The raidzN resilver code is inefficient, especially when there are a lot of disks in the vdev, because... 1. It processes one slab at a time. That''s very important. Each disk spends a lot of idle time waiting for the next disk to fetch something, so there is an opportunity to start prefetching data on the idle disks, and that is not happening. 2. Each slab is spread across many disks, so the average seek time to fetch the slab approaches the maximum seek time of a single disk. That means an average 2x longer than average seek time. 2a. The more disks in the vdev, the smaller the piece of data that gets written to each individual disk. So you are waiting for the maximum seek time, in order to fetch a slab fragment which is tiny ... 3. The order of slab fetching is determined by creation time, not by disk layout. This is a huge setback. It means each seek is essentially random, which yields maximum seek time, instead of being sequential which approaches zero seek time. If you could cut the seek time down to zero, you would have infinitely faster IOPS. Something divided by zero is infinity. Suddenly you wouldn''t care about seek time and you''d start paying attention to some other limiting factor. http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg42017.html 4. Guess what happens if you have 2 or 3 failed disks in your raidz3, and they''re trying to resilver at the same time. Does the system ignore subsequently failed disks and concentrate on restoring a single disk quickly? Or does the system try to resilver them all simultaneously and therefore double or triple the time before any one disk is fully resilvered? 5. If all your files reside in one big raidz3, that means a little piece of *every* slab in the pool must be on each disk. We''ve concluded above that you are approaching maximum seek time, and now we''re also concluding you must do the maximum number of possible seeks. If instead, you break your big raidz3 vdev into 3 raidz1 vdev''s, that means each raidz1 vdev will have approx 33% as many slab pieces on it. If you need to resilver a disk, even though you''re resilvering approximately the same number of bytes per disk as you would have in raidz3, in the raidz1 you''ve cut the number of seeks down to 33%, and you''ve reduced the time necessary for each of those seeks. Still better ... Compare a 23-disk raidz3 (capacity of 20 disks) against 20 mirrors. Resilver one disk. You only require 5% as many seeks, and each seek will go twice as fast. So the mirror will resilver 40x faster. Also, if anybody is actually using the pool during that time, only 5% of the user operations will result in a seek on the resilvering mirror disk, while 100% of the user operations will hurt the raidz3 resilver. 6. Please see the following calculation of probability of failure of 20 mirrors vs 23 disk raidz3. According to my calculations, the probability of 4 disk failure in raidz3 is approx 4.4E-4 and the probability of 2 disks in the same mirror failing is approx 5E-5. So the chances of either pool to fail is very small, but the raidz3 is approx 10x more likely to suffer pool failure than the mirror setup. Granted there is some linear estimation which is not entirely accurate, but I think the calculation comes within an order of magnitude of being correct. The mirror setup is 65% more hardware, 10x more reliable, and much faster than the raidz3 setup, same usable capacity. http://dl.dropbox.com/u/543241/raidz3%20vs%20mirrors.pdf ... Compare the 21disk raidz3 versus 3 vdev''s of 7-disk raidz1. You get more than 3x faster resilver time with the smaller vdev''s, and you only get 3x the redundancy in the raidz3. That means the probability of 4 simultaneously failed disks in the raidz3 is higher than the probability of 2 failed disks in a single raidz1 vdev.
I would definitely consider raidz2 or raidz3 in several vdevs. Maximum 8-9 drives in each vdev. Not a huge 20 disc vdev. One vdev gives you the IOPS as in one single drive. If you have three vdevs, you get IOPS worth of three drives. That is better than one single vdev of 20 discs. -- This message posted from opensolaris.org
On Oct 16, 2010, at 4:57 AM, Edward Ned Harvey wrote:>> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us] >> >>> raidzN takes a really long time to resilver (code written >> inefficiently, >>> it''s a known problem.) If you had a huge raidz3, it would literally >> never >>> finish, because it couldn''t resilver as fast as new data appears. A >> week >> >> In what way is the code written inefficiently? > > Here is a link to one message in the middle of a really long thread, which > touched on a lot of things, so it''s difficult to read the thread now and get > what it all boils down to and which parts are relevant to the present > discussion. Relevant comments below... > http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg41998.html > > In conclusion of the referenced thread: > > The raidzN resilver code is inefficient, especially when there are a lot of > disks in the vdev, because... > > 1. It processes one slab at a time. That''s very important. Each disk > spends a lot of idle time waiting for the next disk to fetch something, so > there is an opportunity to start prefetching data on the idle disks, and > that is not happening.Slabs don''t matter. So the rest of this argument is moot.> 2. Each slab is spread across many disks, so the average seek time to fetch > the slab approaches the maximum seek time of a single disk. That means an > average 2x longer than average seek time.nope.> 2a. The more disks in the vdev, the smaller the piece of data that gets > written to each individual disk. So you are waiting for the maximum seek > time, in order to fetch a slab fragment which is tiny ...This is an oversimplification. In all of the resilvering tests I''ve done, the resilver time is entirely based on the random write performance of the resilvering disk.> 3. The order of slab fetching is determined by creation time, not by disk > layout. This is a huge setback. It means each seek is essentially random, > which yields maximum seek time, instead of being sequential which approaches > zero seek time. If you could cut the seek time down to zero, you would have > infinitely faster IOPS. Something divided by zero is infinity. Suddenly > you wouldn''t care about seek time and you''d start paying attention to some > other limiting factor. > http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg42017.htmlSeeks are usually quite small compared to the rotational delay, due to the way data is written.> 4. Guess what happens if you have 2 or 3 failed disks in your raidz3, and > they''re trying to resilver at the same time. Does the system ignore > subsequently failed disks and concentrate on restoring a single disk > quickly?No, of course.> Or does the system try to resilver them all simultaneously and > therefore double or triple the time before any one disk is fully resilvered?Yes, of course.> 5. If all your files reside in one big raidz3, that means a little piece of > *every* slab in the pool must be on each disk. We''ve concluded above that > you are approaching maximum seek time,No, you are jumping to the conclusion that data is allocated at the beginning and the end of the device, which is not the case.> and now we''re also concluding you > must do the maximum number of possible seeks. If instead, you break your > big raidz3 vdev into 3 raidz1 vdev''s, that means each raidz1 vdev will have > approx 33% as many slab pieces on it.Again, misuse of the term "slab." A record will exist in only one set. So it is simply a matter of finding the records that need to be resilvered.> If you need to resilver a disk, even > though you''re resilvering approximately the same number of bytes per disk as > you would have in raidz3, in the raidz1 you''ve cut the number of seeks down > to 33%, and you''ve reduced the time necessary for each of those seeks.No, not really. The metadata contains the information you need to locate the records to be resilvered. By design, the metadata is redundant and spread across top-level vdevs or, in the case of a single top-level vdev, made redundant and diverse. So there are two activities in play: 1. metadata is read in time order and prefetched 2. records are reconstructed from the surviving vdevs> Still better ... Compare a 23-disk raidz3 (capacity of 20 disks) against 20 > mirrors. Resilver one disk. You only require 5% as many seeks, and each > seek will go twice as fast.Again, this is an oversimplification that assumes seeks are not done in parallel. In reality, the I/Os are scheduled to each device in the set concurrently, so the total number of seeks per set is moot.> So the mirror will resilver 40x faster.I''ve never seen data to support this. And yes, I''ve done many experiments and observed real-life reconstruction.> Also, > if anybody is actually using the pool during that time, only 5% of the user > operations will result in a seek on the resilvering mirror disk, while 100% > of the user operations will hurt the raidz3 resilver.Good argument for SSDs, yes? :-)> 6. Please see the following calculation of probability of failure of 20 > mirrors vs 23 disk raidz3. According to my calculations, the probability of > 4 disk failure in raidz3 is approx 4.4E-4 and the probability of 2 disks in > the same mirror failing is approx 5E-5. So the chances of either pool to > fail is very small, but the raidz3 is approx 10x more likely to suffer pool > failure than the mirror setup. Granted there is some linear estimation > which is not entirely accurate, but I think the calculation comes within an > order of magnitude of being correct. The mirror setup is 65% more hardware, > 10x more reliable, and much faster than the raidz3 setup, same usable > capacity. > http://dl.dropbox.com/u/543241/raidz3%20vs%20mirrors.pdfOk, you''ve share the math and it isn''t quite right. To build a better model, you will need to work on the probability of each sector being far and corrupt. What we tend to see in the field is that the probability of failure follows the models from the vendors and the locality follows more traditional location models. Location models for HDDs are not easy, because there is so many layers of reordering, caching, and optimization. IMHO it is better to rely on empirical studies, which I have done. My data does not match your model very well. Do you have some measurements to back up your hypothesis?> Compare the 21disk raidz3 versus 3 vdev''s of 7-disk raidz1. You get more > than 3x faster resilver time with the smaller vdev''s, and you only get 3x > the redundancy in the raidz3. That means the probability of 4 > simultaneously failed disks in the raidz3 is higher than the probability of > 2 failed disks in a single raidz1 vdev.Disagree. We do have models for this and can do the math. Starting with the model I described in ZFS data protection comparison and extending to 21 disks, we see: Config MTTDL[1] (years) 3x7 disk raidz1 2,581 21 disk raidz3 37,499,659 As I''ve said many times, and shown data to prove (next chance is at the OpenStorage Summit in a few weeks :-) that the resilver becomes constrained by the performance of the resilvering disk, not the surviving disks. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com USENIX LISA ''10 Conference, November 7-12, San Jose, CA ZFS and performance consulting http://www.RichardElling.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101017/7e62bae0/attachment-0001.html>
> From: Richard Elling [mailto:richard.elling at gmail.com] > > > http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg41998.html > > Slabs don''t matter. So the rest of this argument is moot.Tell it to Erik. He might want to know. Or maybe he knows better than you.> 2. Each slab is spread across many disks, so the average seek time to > fetch > the slab approaches the maximum seek time of a single disk. ?That means > an > average 2x longer than average seek time. > > nope.Anything intelligent to add? Or just "nope"> Seeks are usually quite small compared to the rotational delay, due to > the way data is written.I''m using the term "seek time" to reference from time the drive receives an instruction, to the time it actually is able to read/write the requested data. In drive spec sheets, this is often referred to as "seek time" so I don''t think I''m misusing the term, and it includes the rotational delay.> 4. Guess what happens if you have 2 or 3 failed disks in your raidz3, > and > they''re trying to resilver at the same time. ?Does the system ignore > subsequently failed disks and concentrate on restoring a single disk > quickly? > > No, of course. > > > Or does the system try to resilver them all simultaneously and > therefore double or triple the time before any one disk is fully > resilvered? > > Yes, of course.Are those supposed to be real answers? Or are you mocking me? It sounds like mocking. If you don''t mind, please try to stick with productive conversation. I''m just skipping the rest of your reply from here down, because I''m considering it hostile and unnecessary to read or reply further.
On Oct 18, 2010, at 6:52 AM, Edward Ned Harvey wrote:>> From: Richard Elling [mailto:richard.elling at gmail.com] >> >>> http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg41998.html >> >> Slabs don''t matter. So the rest of this argument is moot. > > Tell it to Erik. He might want to know. Or maybe he knows better than you.You were the one who posted this. If you intend to follow citations, then there are quite a number of useful discussions on resilvering in the 2007-2008 archives.>> 2. Each slab is spread across many disks, so the average seek time to >> fetch >> the slab approaches the maximum seek time of a single disk. That means >> an >> average 2x longer than average seek time. >> >> nope. > > Anything intelligent to add? Or just "nope"The assertion that an average 2x longer than average seek time is wrong. This is all done in parallel, not serially, so there is no 2x penalty.>> Seeks are usually quite small compared to the rotational delay, due to >> the way data is written. > > I''m using the term "seek time" to reference from time the drive receives an > instruction, to the time it actually is able to read/write the requested > data. In drive spec sheets, this is often referred to as "seek time" so I > don''t think I''m misusing the term, and it includes the rotational delay.It is important because you have concentrated your concern based on seek time. Even if the seek time were zero, you can''t get past the rot delay on HDDs. For reads, which we are concerned about here, the likelihood of data existing in the track cache is high and so the penalty of a blown rev is low.>> 4. Guess what happens if you have 2 or 3 failed disks in your raidz3, >> and >> they''re trying to resilver at the same time. Does the system ignore >> subsequently failed disks and concentrate on restoring a single disk >> quickly? >> >> No, of course. >> >> >> Or does the system try to resilver them all simultaneously and >> therefore double or triple the time before any one disk is fully >> resilvered? >> >> Yes, of course. > > Are those supposed to be real answers? Or are you mocking me? It sounds > like mocking. > > If you don''t mind, please try to stick with productive conversation. I''m > just skipping the rest of your reply from here down, because I''m considering > it hostile and unnecessary to read or reply further.If you want to recommend configurations and compare or contrast their merits, then you should be able to defend your decisions. In engineering, this would be known as a critical design review, where the operational definition of "critical" is expressing of involving an analysis of the merits and faults of a work product incorporating a detailed and scholarly analysis and commentary. While people who are not experienced with critical design reviews may view them as hostile, the desire to achieve a better product or result is the ultimate goal. Check your ego at the door. -- richard
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > > 4. Guess what happens if you have 2 or 3 failed disks in your raidz3, > and > they''re trying to resilver at the same time. Does the system ignore > subsequently failed disks and concentrate on restoring a single disk > quickly? Or does the system try to resilver them all simultaneously > and > therefore double or triple the time before any one disk is fully > resilvered?This is a legitimate question. If anyone knows, I''d like to know...
On Wed, Oct 20, 2010 at 4:05 PM, Edward Ned Harvey <shill at nedharvey.com> wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey >> >> 4. Guess what happens if you have 2 or 3 failed disks in your raidz3, >> and >> they''re trying to resilver at the same time. ?Does the system ignore >> subsequently failed disks and concentrate on restoring a single disk >> quickly? ?Or does the system try to resilver them all simultaneously >> and >> therefore double or triple the time before any one disk is fully >> resilvered? > > This is a legitimate question. ?If anyone knows, I''d like to know... >My recent experience with os_111b, os_134 and oi_147 was that subsequent failure and disk replacement causes resilver to restart from beginning, including the new disks on the later pass. If disk is not replaced, the resilver would run to completion (and then a replace could be performed with a new resilver). This however is an issue that is being developed further so changes may be coming. -- - Tuomas
Since my name was mention, a couple of things: (a) I''m not infallible. :-) (b) In my posts, I swapped "slab" for "record". I really should have said "record". It''s more correct as to what''s going on. (c) It is possible for constituent drives in a RaidZ to be issued concurrent requests for portions of a record, which *may* increase efficiency. So, the "assembly" of a complete record isn''t completely a serial operation (that is, ZFS doesn''t wait for all the parts of a record to be assembled before issuing further requests for the next record) So, drives may have requests for multiple portions of records sitting in their "todo" queues. Thus, all "good" (i.e. being rebuilt *from*) drives should be constantly busy, and not waiting around for others to finish reading data. That all said, I don''t see (in the code) where the place is that indicates how many records can be done in parallel. 2? 4? 20? It matters quite a bit. (d) writing completed record parts (i.e. the segment that needs to be resilvered) is also queued up, so, for the most part, the replaced drive is doing relatively sequential IO. That is, *usually* the head doesn''t have to seek and *may* not even have to wait much for rotational delay - it just stays where it left off and writes the next reconstructed data. Now, for drives which are not replaced, but rather just "stale", this isn''t often true, and those drives may be stuck seeking quite a bit. But, since they''re usually only slightly stale, it isn''t noticed that much. (e) Given C above, the average performance of a drive being read does tend to be "average" for random IO - that is, half the max seek time, plus half the average rotational latency. NCQ/etc will help this by clustering reads, so actual performance should be better than a pure average, but I''d not bet on a significant improvement. And, for a typical pools, I''m going to make a bald-faced statement that the HD read cache is going to be much less helpful than usual (as for a typical filesystem with lots of small files, most will fit in a single record, and the next location on the HD is likely NOT to be something you want) - that is, HD read-ahead cache misses are going to be frequent. All this assumes you are reconstructing a drive which has not been sequentially written to - those types of zpools will resilver much faster than zpools exposed to "typical" read/write patterns. (f) IOPS is going to be the limiting factor, particularly for the resilvering drive, as there is less opportunity to group writes than there is to group reads (even allowing for D above). My reading of the code says that ZFS issues writes to the resilver drive as the opportunity comes - that is, ZFS itself doesn''t try to batch up multiple records into a single write request. I''d like verification of this, though. -Erik -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)