thr3ads.net - zfs discuss - [zfs-discuss] Optimal raidz3 configuration [Oct 2010]

If this information is useful, please help other people find it:
Share via:

Peter Taps

2010-Oct-13 22:21 UTC

[zfs-discuss] Optimal raidz3 configuration

Folks,

If I have 20 disks to build a raidz3 pool, do I create one big raidz vdev or do
I create multiple raidz3 vdevs? Is there any advantage of having multiple raidz3
vdevs in a single pool?

Thank you in advance for your help.

Regards,
Peter
-- 
This message posted from opensolaris.org

Scott Meilicke

2010-Oct-13 22:27 UTC

head link

[zfs-discuss] Optimal raidz3 configuration

Hello Peter, 

Read the ZFS Best Practices Guide to start. If you still have questions, post
back to the list.

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Storage_Pool_Performance_Considerations

-Scott

On Oct 13, 2010, at 3:21 PM, Peter Taps wrote:
> Folks,
> 
> If I have 20 disks to build a raidz3 pool, do I create one big raidz vdev
or do I create multiple raidz3 vdevs? Is there any advantage of having multiple
raidz3 vdevs in a single pool?
> 
> Thank you in advance for your help.
> 
> Regards,
> Peter
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Scott Meilicke

Edward Ned Harvey

2010-Oct-14 01:26 UTC

head link

[zfs-discuss] Optimal raidz3 configuration

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Peter Taps
> 
> If I have 20 disks to build a raidz3 pool, do I create one big raidz
> vdev or do I create multiple raidz3 vdevs? Is there any advantage of
> having multiple raidz3 vdevs in a single pool?
whatever you do, *don''t* configure one huge raidz3.

Consider either:  3 vdev''s of each 7-disk raidz1, or 3 vdev''s
of 7-disk
raidz2, or something along these lines.  Perhaps 3 vdev''s of each
6-disk
raidz1, and two hotspares.

raidzN takes a really long time to resilver (code written inefficiently,
it''s a known problem.)  If you had a huge raidz3, it would literally
never
finish, because it couldn''t resilver as fast as new data appears.  A
week
later you''d destroy & rebuild your whole pool.

If you can afford mirrors, your risk is much lower.  Because although
it''s
physically possible for 2 disks to fail simultaneously and ruin the pool,
the probability of that happening is smaller than the probability of 3
simultaneous disk failures on the raidz3.  Due to smaller resilver window.

I highly endorse mirrors for nearly all purposes.

David Magda

2010-Oct-14 13:39 UTC

head link

[zfs-discuss] Optimal raidz3 configuration

On Wed, October 13, 2010 21:26, Edward Ned Harvey wrote:
> I highly endorse mirrors for nearly all purposes.
Are you a member of BAARF?

    http://www.miracleas.com/BAARF/BAARF2.html

 :)

Edward Ned Harvey

2010-Oct-14 14:03 UTC

head link

[zfs-discuss] Optimal raidz3 configuration

> From: David Magda [mailto:dmagda at ee.ryerson.ca]
> 
> On Wed, October 13, 2010 21:26, Edward Ned Harvey wrote:
> 
> > I highly endorse mirrors for nearly all purposes.
> 
> Are you a member of BAARF?
> 
>     http://www.miracleas.com/BAARF/BAARF2.html
Never heard of it.  I don''t quite get it ... They want people to stop
talking about pros/cons of various types of raid?  That''s definitely
not me.

I think there are lots of pros/cons, and many of them have nuances, and vary
by implementation...  I think it''s important to keep talking about it,
and
all us "experts" in the field can keep current on all this ...

Take, for example, the number of people discussing things in this mailing
list, who say they still use hardware raid.  That alone demonstrates
misinformation (in most cases) and warrants more discussion.  ;-)

Bob Friesenhahn

2010-Oct-15 21:12 UTC

head link

[zfs-discuss] Optimal raidz3 configuration

On Wed, 13 Oct 2010, Edward Ned Harvey wrote:>
> raidzN takes a really long time to resilver (code written inefficiently,
> it''s a known problem.)  If you had a huge raidz3, it would
literally never
> finish, because it couldn''t resilver as fast as new data appears. 
A week
In what way is the code written inefficiently?

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Marty Scholes

2010-Oct-15 22:16 UTC

head link

[zfs-discuss] Optimal raidz3 configuration

Sorry, I can''t not respond...

Edward Ned Harvey wrote:> whatever you do, *don''t* configure one huge raidz3.
Peter, whatever you do, *don''t* make a decision based on blanket
generalizations.
> If you can afford mirrors, your risk is much lower.
>  Because although it''s
> hysically possible for 2 disks to fail simultaneously
> and ruin the pool,
> the probability of that happening is smaller than the
> probability of 3
> simultaneous disk failures on the raidz3.
Edward, I normally agree with most of what you have to say, but this has gone
off the deep end.  I can think of counter-use-cases far faster than I can type.
>  Due to
> smaller resilver window.
Coupled with a smaller MTTDL, smaller cabinet space yield, smaller $/GB ratio,
etc.
> I highly endorse mirrors for nearly all purposes.
Clearly.

Peter, go straight to the source.

http://blogs.sun.com/roch/entry/when_to_and_not_to

In short:
1. vdev_count = spindle_count / (stripe_width + parity_count)
2. IO/s is proprotional to vdev_count
3. Usable capacity is proportional to stripe_width * vdev_count
4. A mirror can be approximated by a stripe of width one
5. Mean time to data loss increases exponentially with parity_count
6. Resilver time increases (super)linearly with stripe width

Balance capacity available, storage needed, performance needed and your own
level of paranoia regarding data loss.

My home server''s main storage is a 22 (19 + 3) disk RAIDZ3 pool backed
up hourly to a 14 (11+3) RAIDZ3 backup pool.

Clearly this is not a production Oracle server.  Equally clear is that my
paranoia index is rather high.

ZFS will let you choose the combination of stripe width and parity count which
works for you.

There is no "one size fits all."
-- 
This message posted from opensolaris.org

Freddie Cash

2010-Oct-15 22:26 UTC

head link

[zfs-discuss] Optimal raidz3 configuration

On Fri, Oct 15, 2010 at 3:16 PM, Marty Scholes <martyscholes at yahoo.com>
wrote:> My home server''s main storage is a 22 (19 + 3) disk RAIDZ3 pool
backed up hourly to a 14 (11+3) RAIDZ3 backup pool.
How long does it take to resilver a disk in that pool?  And how long
does it take to run a scrub?

When I initially setup a 24-disk raidz2 vdev, it died trying to
resilver a single 500 GB SATA disk.  I/O under 1 MBps, all 24 drives
thrashing like crazy, could barely even login to the system and type
onscreen.  It was a nightmare.

That, and normal (no scrub, no resilver) disk I/O was abysmal.

Since then, I''ve avoided any vdev with more than 8 drives in it.

-- 
Freddie Cash
fjwcash at gmail.com

Marty Scholes

2010-Oct-15 23:29 UTC

head link

[zfs-discuss] Optimal raidz3 configuration

> On Fri, Oct 15, 2010 at 3:16 PM, Marty Scholes
> <martyscholes at yahoo.com> wrote:
> > My home server''s main storage is a 22 (19 + 3) disk
> RAIDZ3 pool backed up hourly to a 14 (11+3) RAIDZ3
> backup pool.
> 
> How long does it take to resilver a disk in that
> pool?  And how long
> does it take to run a scrub?
> 
> When I initially setup a 24-disk raidz2 vdev, it died
> trying to
> resilver a single 500 GB SATA disk.  I/O under 1
> MBps, all 24 drives
> thrashing like crazy, could barely even login to the
> system and type
> onscreen.  It was a nightmare.
> 
> That, and normal (no scrub, no resilver) disk I/O was
> abysmal.
> 
> Since then, I''ve avoided any vdev with more than 8
> drives in it.
MY situation is kind of unique.  I picked up 120 15K 73GB FC disks early this
year for $2 per.  As such, spindle count is a non-issue.  As a home server, it
has very little need for write iops and I have 8 disks for L2ARC on the main
pool.

Main pool is at 40% capacity and backup pool is at 65% capacity.  Both take
about 70 minutes to scrub.  The last time I tested a resilver it took about 3
hours.

The difference is that these are low capacity 15K FC spindles and the pool has
very little sustained I/O; it only bursts now and again.  Resilvers would go
mostly uncontested, and with RAIDZ3 + autoreplace=off, I can actually schedule a
resilver.
-- 
This message posted from opensolaris.org

Ian Collins

2010-Oct-16 02:51 UTC

head link

[zfs-discuss] Optimal raidz3 configuration

On 10/16/10 12:29 PM, Marty Scholes wrote:>> On Fri, Oct 15, 2010 at 3:16 PM, Marty Scholes
>> <martyscholes at yahoo.com>  wrote:
>>      
>>> My home server''s main storage is a 22 (19 + 3) disk
>>>        
>> RAIDZ3 pool backed up hourly to a 14 (11+3) RAIDZ3
>> backup pool.
>>
>> How long does it take to resilver a disk in that
>> pool?  And how long
>> does it take to run a scrub?
>>
>> When I initially setup a 24-disk raidz2 vdev, it died
>> trying to
>> resilver a single 500 GB SATA disk.  I/O under 1
>> MBps, all 24 drives
>> thrashing like crazy, could barely even login to the
>> system and type
>> onscreen.  It was a nightmare.
>>
>> That, and normal (no scrub, no resilver) disk I/O was
>> abysmal.
>>
>> Since then, I''ve avoided any vdev with more than 8
>> drives in it.
>>      
> MY situation is kind of unique.  I picked up 120 15K 73GB FC disks early
this year for $2 per.  As such, spindle count is a non-issue.  As a home server,
it has very little need for write iops and I have 8 disks for L2ARC on the main
pool.
>
>    I''d hate to be paying your power bill!
> Main pool is at 40% capacity and backup pool is at 65% capacity.  Both take
about 70 minutes to scrub.  The last time I tested a resilver it took about 3
hours.
>
>    So a tiny fast drive takes three hours, consider how long a 30x bigger, 
much slower drive will take.

-- 
Ian.

Edward Ned Harvey

2010-Oct-16 11:57 UTC

head link

[zfs-discuss] Optimal raidz3 configuration

> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us]
> 
> > raidzN takes a really long time to resilver (code written
> inefficiently,
> > it''s a known problem.)  If you had a huge raidz3, it would
literally
> never
> > finish, because it couldn''t resilver as fast as new data
appears.  A
> week
> 
> In what way is the code written inefficiently?
Here is a link to one message in the middle of a really long thread, which
touched on a lot of things, so it''s difficult to read the thread now
and get
what it all boils down to and which parts are relevant to the present
discussion.  Relevant comments below...
http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg41998.html

In conclusion of the referenced thread:

The raidzN resilver code is inefficient, especially when there are a lot of
disks in the vdev, because...

1. It processes one slab at a time.  That''s very important.  Each disk
spends a lot of idle time waiting for the next disk to fetch something, so
there is an opportunity to start prefetching data on the idle disks, and
that is not happening.

2. Each slab is spread across many disks, so the average seek time to fetch
the slab approaches the maximum seek time of a single disk.  That means an
average 2x longer than average seek time.

2a. The more disks in the vdev, the smaller the piece of data that gets
written to each individual disk.  So you are waiting for the maximum seek
time, in order to fetch a slab fragment which is tiny ...

3. The order of slab fetching is determined by creation time, not by disk
layout.  This is a huge setback.  It means each seek is essentially random,
which yields maximum seek time, instead of being sequential which approaches
zero seek time.  If you could cut the seek time down to zero, you would have
infinitely faster IOPS.  Something divided by zero is infinity.  Suddenly
you wouldn''t care about seek time and you''d start paying
attention to some
other limiting factor.
http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg42017.html

4. Guess what happens if you have 2 or 3 failed disks in your raidz3, and
they''re trying to resilver at the same time.  Does the system ignore
subsequently failed disks and concentrate on restoring a single disk
quickly?  Or does the system try to resilver them all simultaneously and
therefore double or triple the time before any one disk is fully resilvered?

5. If all your files reside in one big raidz3, that means a little piece of
*every* slab in the pool must be on each disk.  We''ve concluded above
that
you are approaching maximum seek time, and now we''re also concluding
you
must do the maximum number of possible seeks.  If instead, you break your
big raidz3 vdev into 3 raidz1 vdev''s, that means each raidz1 vdev will
have
approx 33% as many slab pieces on it.  If you need to resilver a disk, even
though you''re resilvering approximately the same number of bytes per
disk as
you would have in raidz3, in the raidz1 you''ve cut the number of seeks
down
to 33%, and you''ve reduced the time necessary for each of those seeks.
Still better ... Compare a 23-disk raidz3 (capacity of 20 disks) against 20
mirrors.  Resilver one disk.  You only require 5% as many seeks, and each
seek will go twice as fast.  So the mirror will resilver 40x faster.  Also,
if anybody is actually using the pool during that time, only 5% of the user
operations will result in a seek on the resilvering mirror disk, while 100%
of the user operations will hurt the raidz3 resilver.

6. Please see the following calculation of probability of failure of 20
mirrors vs 23 disk raidz3.  According to my calculations, the probability of
4 disk failure in raidz3 is approx 4.4E-4 and the probability of 2 disks in
the same mirror failing is approx 5E-5.  So the chances of either pool to
fail is very small, but the raidz3 is approx 10x more likely to suffer pool
failure than the mirror setup.  Granted there is some linear estimation
which is not entirely accurate, but I think the calculation comes within an
order of magnitude of being correct.  The mirror setup is 65% more hardware,
10x more reliable, and much faster than the raidz3 setup, same usable
capacity.
http://dl.dropbox.com/u/543241/raidz3%20vs%20mirrors.pdf 

...

Compare the 21disk raidz3 versus 3 vdev''s of 7-disk raidz1.  You get
more
than 3x faster resilver time with the smaller vdev''s, and you only get
3x
the redundancy in the raidz3.  That means the probability of 4
simultaneously failed disks in the raidz3 is higher than the probability of
2 failed disks in a single raidz1 vdev.

Orvar Korvar

2010-Oct-17 08:02 UTC

head link

[zfs-discuss] Optimal raidz3 configuration

I would definitely consider raidz2 or raidz3 in several vdevs. Maximum 8-9
drives in each vdev. Not a huge 20 disc vdev.

One vdev gives you the IOPS as in one single drive. If you have three vdevs, you
get  IOPS worth of three drives. That is better than one single vdev of 20
discs.
-- 
This message posted from opensolaris.org

Richard Elling

2010-Oct-18 04:01 UTC

head link

[zfs-discuss] Optimal raidz3 configuration

On Oct 16, 2010, at 4:57 AM, Edward Ned Harvey wrote:
>> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us]
>> 
>>> raidzN takes a really long time to resilver (code written
>> inefficiently,
>>> it''s a known problem.)  If you had a huge raidz3, it would
literally
>> never
>>> finish, because it couldn''t resilver as fast as new data
appears.  A
>> week
>> 
>> In what way is the code written inefficiently?
> 
> Here is a link to one message in the middle of a really long thread, which
> touched on a lot of things, so it''s difficult to read the thread
now and get
> what it all boils down to and which parts are relevant to the present
> discussion.  Relevant comments below...
> http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg41998.html
> 
> In conclusion of the referenced thread:
> 
> The raidzN resilver code is inefficient, especially when there are a lot of
> disks in the vdev, because...
> 
> 1. It processes one slab at a time.  That''s very important.  Each
disk
> spends a lot of idle time waiting for the next disk to fetch something, so
> there is an opportunity to start prefetching data on the idle disks, and
> that is not happening.
Slabs don''t matter. So the rest of this argument is moot.
> 2. Each slab is spread across many disks, so the average seek time to fetch
> the slab approaches the maximum seek time of a single disk.  That means an
> average 2x longer than average seek time.
nope.
> 2a. The more disks in the vdev, the smaller the piece of data that gets
> written to each individual disk.  So you are waiting for the maximum seek
> time, in order to fetch a slab fragment which is tiny ...
This is an oversimplification.  In all of the resilvering tests I''ve
done, the
resilver time is entirely based on the random write performance of the
resilvering disk. 
> 3. The order of slab fetching is determined by creation time, not by disk
> layout.  This is a huge setback.  It means each seek is essentially random,
> which yields maximum seek time, instead of being sequential which
approaches
> zero seek time.  If you could cut the seek time down to zero, you would
have
> infinitely faster IOPS.  Something divided by zero is infinity.  Suddenly
> you wouldn''t care about seek time and you''d start paying
attention to some
> other limiting factor.
> http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg42017.html
Seeks are usually quite small compared to the rotational delay, due to
the way data is written.
> 4. Guess what happens if you have 2 or 3 failed disks in your raidz3, and
> they''re trying to resilver at the same time.  Does the system
ignore
> subsequently failed disks and concentrate on restoring a single disk
> quickly?  
No, of course.
> Or does the system try to resilver them all simultaneously and
> therefore double or triple the time before any one disk is fully
resilvered?
Yes, of course.
> 5. If all your files reside in one big raidz3, that means a little piece of
> *every* slab in the pool must be on each disk.  We''ve concluded
above that
> you are approaching maximum seek time,
No, you are jumping to the conclusion that data is allocated at the beginning
and the end of the device, which is not the case.
> and now we''re also concluding you
> must do the maximum number of possible seeks.  If instead, you break your
> big raidz3 vdev into 3 raidz1 vdev''s, that means each raidz1 vdev
will have
> approx 33% as many slab pieces on it.  
Again, misuse of the term "slab."  A record will exist in only one
set.  So it is
simply a matter of finding the records that need to be resilvered.
> If you need to resilver a disk, even
> though you''re resilvering approximately the same number of bytes
per disk as
> you would have in raidz3, in the raidz1 you''ve cut the number of
seeks down
> to 33%, and you''ve reduced the time necessary for each of those
seeks.
No, not really. The metadata contains the information you need to locate
the records to be resilvered. By design, the metadata is redundant and spread
across top-level vdevs or, in the case of a single top-level vdev, made 
redundant and diverse. So there are two activities in play:
	1. metadata is read in time order and prefetched
	2. records are reconstructed from the surviving vdevs
> Still better ... Compare a 23-disk raidz3 (capacity of 20 disks) against 20
> mirrors.  Resilver one disk.  You only require 5% as many seeks, and each
> seek will go twice as fast.  
Again, this is an oversimplification that assumes seeks are not done in
parallel. In reality, the I/Os are scheduled to each device in the set
concurrently,
so the total number of seeks per set is moot.
> So the mirror will resilver 40x faster.  
I''ve never seen data to support this.  And yes, I''ve done many
experiments
and observed real-life reconstruction.
> Also,
> if anybody is actually using the pool during that time, only 5% of the user
> operations will result in a seek on the resilvering mirror disk, while 100%
> of the user operations will hurt the raidz3 resilver.
Good argument for SSDs, yes? :-)
> 6. Please see the following calculation of probability of failure of 20
> mirrors vs 23 disk raidz3.  According to my calculations, the probability
of
> 4 disk failure in raidz3 is approx 4.4E-4 and the probability of 2 disks in
> the same mirror failing is approx 5E-5. So the chances of either pool to
> fail is very small, but the raidz3 is approx 10x more likely to suffer pool
> failure than the mirror setup.  Granted there is some linear estimation
> which is not entirely accurate, but I think the calculation comes within an
> order of magnitude of being correct.  The mirror setup is 65% more
hardware,
> 10x more reliable, and much faster than the raidz3 setup, same usable
> capacity.
> http://dl.dropbox.com/u/543241/raidz3%20vs%20mirrors.pdf 
Ok, you''ve share the math and it isn''t quite right. To build a
better model,
you will need to work on the probability of each sector being far and
corrupt.  What we tend to see in the field is that the probability of failure
follows the models from the vendors and the locality follows more 
traditional location models.  Location models for HDDs are not easy,
because there is so many layers of reordering, caching, and optimization.  
IMHO it is better to rely on empirical studies, which I have done. My data 
does not match your model very well.  Do you have some measurements to
back up your hypothesis?
> Compare the 21disk raidz3 versus 3 vdev''s of 7-disk raidz1.  You
get more
> than 3x faster resilver time with the smaller vdev''s, and you only
get 3x
> the redundancy in the raidz3.  That means the probability of 4
> simultaneously failed disks in the raidz3 is higher than the probability of
> 2 failed disks in a single raidz1 vdev.
Disagree.  We do have models for this and can do the math.  Starting with the
model I described in ZFS data protection comparison and extending to 21
disks, we see:
	Config			MTTDL[1] (years)
	3x7 disk raidz1	     2,581
	21 disk raidz3	37,499,659

As I''ve said many times, and shown data to prove (next chance is at the
OpenStorage Summit in a few weeks :-) that the resilver becomes constrained
by the performance of the resilvering disk, not the surviving disks.
 -- richard



-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
USENIX LISA ''10 Conference, November 7-12, San Jose, CA
ZFS and performance consulting
http://www.RichardElling.com













-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101017/7e62bae0/attachment-0001.html>

Edward Ned Harvey

2010-Oct-18 13:52 UTC

head link

[zfs-discuss] Optimal raidz3 configuration

> From: Richard Elling [mailto:richard.elling at gmail.com]
>
> > http://www.mail-archive.com/zfs-discuss at
opensolaris.org/msg41998.html
> 
> Slabs don''t matter. So the rest of this argument is moot.
Tell it to Erik.  He might want to know.  Or maybe he knows better than you.

> 2. Each slab is spread across many disks, so the average seek time to
> fetch
> the slab approaches the maximum seek time of a single disk. ?That means
> an
> average 2x longer than average seek time.
> 
> nope.
Anything intelligent to add?  Or just "nope"

> Seeks are usually quite small compared to the rotational delay, due to
> the way data is written.
I''m using the term "seek time" to reference from time the
drive receives an
instruction, to the time it actually is able to read/write the requested
data.  In drive spec sheets, this is often referred to as "seek time"
so I
don''t think I''m misusing the term, and it includes the
rotational delay.

> 4. Guess what happens if you have 2 or 3 failed disks in your raidz3,
> and
> they''re trying to resilver at the same time. ?Does the system
ignore
> subsequently failed disks and concentrate on restoring a single disk
> quickly?
> 
> No, of course.
> 
> 
> Or does the system try to resilver them all simultaneously and
> therefore double or triple the time before any one disk is fully
> resilvered?
> 
> Yes, of course.
Are those supposed to be real answers?  Or are you mocking me?  It sounds
like mocking.

If you don''t mind, please try to stick with productive conversation. 
I''m
just skipping the rest of your reply from here down, because I''m
considering
it hostile and unnecessary to read or reply further.

Richard Elling

2010-Oct-18 14:06 UTC

head link

[zfs-discuss] Optimal raidz3 configuration

On Oct 18, 2010, at 6:52 AM, Edward Ned Harvey wrote:
>> From: Richard Elling [mailto:richard.elling at gmail.com]
>> 
>>> http://www.mail-archive.com/zfs-discuss at
opensolaris.org/msg41998.html
>> 
>> Slabs don''t matter. So the rest of this argument is moot.
> 
> Tell it to Erik.  He might want to know.  Or maybe he knows better than
you.
You were the one who posted this.  If you intend to follow citations, then
there are quite a number of useful discussions on resilvering in the 2007-2008
archives.
>> 2. Each slab is spread across many disks, so the average seek time to
>> fetch
>> the slab approaches the maximum seek time of a single disk.  That means
>> an
>> average 2x longer than average seek time.
>> 
>> nope.
> 
> Anything intelligent to add?  Or just "nope"
The assertion that an average 2x longer than average seek time is wrong.
This is all done in parallel, not serially, so there is no 2x penalty.
>> Seeks are usually quite small compared to the rotational delay, due to
>> the way data is written.
> 
> I''m using the term "seek time" to reference from time
the drive receives an
> instruction, to the time it actually is able to read/write the requested
> data.  In drive spec sheets, this is often referred to as "seek
time" so I
> don''t think I''m misusing the term, and it includes the
rotational delay.
It is important because you have concentrated your concern based on 
seek time.  Even if the seek time were zero, you can''t get past the rot
delay
on HDDs.  For reads, which we are concerned about here, the likelihood
of data existing in the track cache is high and so the penalty of a blown
rev is low.
>> 4. Guess what happens if you have 2 or 3 failed disks in your raidz3,
>> and
>> they''re trying to resilver at the same time.  Does the system
ignore
>> subsequently failed disks and concentrate on restoring a single disk
>> quickly?
>> 
>> No, of course.
>> 
>> 
>> Or does the system try to resilver them all simultaneously and
>> therefore double or triple the time before any one disk is fully
>> resilvered?
>> 
>> Yes, of course.
> 
> Are those supposed to be real answers?  Or are you mocking me?  It sounds
> like mocking.
> 
> If you don''t mind, please try to stick with productive
conversation.  I''m
> just skipping the rest of your reply from here down, because I''m
considering
> it hostile and unnecessary to read or reply further.
If you want to recommend configurations and compare or contrast their merits,
then you should be able to defend your decisions.  In engineering, this would be
known as a critical design review, where the operational definition of
"critical" is
expressing of involving an analysis of the merits and faults of a work product 
incorporating a detailed and scholarly analysis and commentary. While people
who are not experienced with critical design reviews may view them as hostile,
the desire to achieve a better product or result is the ultimate goal.  Check
your
ego at the door.
 -- richard

Edward Ned Harvey

2010-Oct-20 13:05 UTC

head link

[zfs-discuss] Optimal raidz3 configuration

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> 4. Guess what happens if you have 2 or 3 failed disks in your raidz3,
> and
> they''re trying to resilver at the same time.  Does the system
ignore
> subsequently failed disks and concentrate on restoring a single disk
> quickly?  Or does the system try to resilver them all simultaneously
> and
> therefore double or triple the time before any one disk is fully
> resilvered?
This is a legitimate question.  If anyone knows, I''d like to know...

Tuomas Leikola

2010-Oct-20 13:29 UTC

head link

[zfs-discuss] Optimal raidz3 configuration

On Wed, Oct 20, 2010 at 4:05 PM, Edward Ned Harvey <shill at
nedharvey.com> wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
>>
>> 4. Guess what happens if you have 2 or 3 failed disks in your raidz3,
>> and
>> they''re trying to resilver at the same time. ?Does the system
ignore
>> subsequently failed disks and concentrate on restoring a single disk
>> quickly? ?Or does the system try to resilver them all simultaneously
>> and
>> therefore double or triple the time before any one disk is fully
>> resilvered?
>
> This is a legitimate question. ?If anyone knows, I''d like to
know...
>
My recent experience with os_111b, os_134 and oi_147 was that
subsequent failure and disk replacement causes resilver to restart
from beginning, including the new disks on the later pass. If disk is
not replaced, the resilver would run to completion (and then a replace
could be performed with a new resilver).

This however is an issue that is being developed further so changes
may be coming.

-- 
- Tuomas

Erik Trimble

2010-Oct-21 02:49 UTC

head link

[zfs-discuss] Optimal raidz3 configuration

Since my name was mention, a couple of things:

(a) I''m not infallible. :-)

(b) In my posts, I swapped "slab" for "record". I really
should have
said "record".  It''s more correct as to what''s going
on.

(c) It is possible for constituent drives in a RaidZ to be issued
concurrent requests for portions of a record, which *may* increase
efficiency. So, the "assembly" of a complete record isn''t
completely a
serial operation (that is, ZFS doesn''t wait for all the parts of a
record to be assembled before issuing further requests for the next
record) So, drives may have requests for multiple portions of records
sitting in their "todo" queues. Thus, all "good" (i.e. being
rebuilt
*from*) drives should be constantly busy, and not waiting around for
others to finish reading data.  That all said, I don''t see (in the
code)
where the place is that indicates how many records can be done in
parallel. 2? 4? 20?  It matters quite a bit.

(d) writing completed record parts (i.e. the segment that needs to be
resilvered) is also queued up, so, for the most part, the replaced drive
is doing relatively sequential IO.  That is, *usually* the head doesn''t
have to seek and *may* not even have to wait much for rotational delay -
it just stays where it left off and writes the next reconstructed data.
Now, for drives which are not replaced, but rather just "stale", this
isn''t often true, and those drives may be stuck seeking quite a bit.
But, since they''re usually only slightly stale, it isn''t
noticed that
much.


(e) Given C above, the average performance of a drive being read does
tend to be "average" for random IO - that is, half the max seek time,
plus half the average rotational latency. NCQ/etc will help this by
clustering reads, so actual performance should be better than a pure
average, but I''d not bet on a significant improvement.  And, for a
typical pools, I''m going to make a bald-faced statement that the HD
read
cache is going to be much less helpful than usual (as for a typical
filesystem with lots of small files, most will fit in a single record,
and the next location on the HD is likely NOT to be something you want)
- that is, HD read-ahead cache misses are going to be frequent.  All
this assumes you are reconstructing a drive which has not been
sequentially written to - those types of zpools will resilver much
faster than zpools exposed to "typical" read/write patterns.

(f)  IOPS is going to be the limiting factor, particularly for the
resilvering drive, as there is less opportunity to group writes than
there is to group reads (even allowing for D above).  My reading of the
code says that ZFS issues writes to the resilver drive as the
opportunity comes - that is, ZFS itself doesn''t try to batch up
multiple
records into a single write request.    I''d like verification of this,
though.



-Erik


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

zfs discuss - Oct 2010 - Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration

[zfs-discuss] Optimal raidz3 configuration