thr3ads.net - zfs discuss - [zfs-discuss] Suggested RaidZ configuration... [Sep 2010]

If this information is useful, please help other people find it:
Share via:

hatish

2010-Sep-06 15:53 UTC

[zfs-discuss] Suggested RaidZ configuration...

Im setting up a server with 20x1TB disks. Initially I had thought to setup the
disks using 2 RaidZ2 groups of 10 discs. However, I have just read the Best
Practices guide, and it says your group shouldnt have > 9 disks. So Im
thinking a better configuration would be 2 x 7disk RaidZ2 + 1 x 6disk RaidZ2.
However its 14TB worth of data instead of 16TB.

What are your suggestions and experiences?
-- 
This message posted from opensolaris.org

Carsten Aulbert

2010-Sep-06 16:47 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

Hi

On Monday 06 September 2010 17:53:44 hatish wrote:> Im setting up a server with 20x1TB disks. Initially I had thought to setup
> the disks using 2 RaidZ2 groups of 10 discs. However, I have just read the
> Best Practices guide, and it says your group shouldnt have > 9 disks. So
> Im thinking a better configuration would be 2 x 7disk RaidZ2 + 1 x 6disk
> RaidZ2. However its 14TB worth of data instead of 16TB.
> 
> What are your suggestions and experiences?
Another one is that in one pool all vdev should be equal, i.e. not mixed like 
2x7 and 1x6 (this configuration you most likely will need to force anyway).

First, I''d assess what you want/expect from this file system in then
end.
Maximum performance, maximum reliability or maximum size - as always pick two 
;)

Cheers

Carsten

Orvar Korvar

2010-Sep-06 20:32 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

Can you add another disk? then you have three 7 disc vdevs. (Always use raidz2.)
-- 
This message posted from opensolaris.org

Orvar Korvar

2010-Sep-06 20:33 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

Otherwise you can have 2 discs as hot spare. three 6 disc vdevs.
-- 
This message posted from opensolaris.org

Brandon High

2010-Sep-06 20:48 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

On Mon, Sep 6, 2010 at 8:53 AM, hatish <hatish at gmail.com>
wrote:> Im setting up a server with 20x1TB disks. Initially I had thought to setup
the disks using 2 RaidZ2 groups of 10 discs. However, I have just read the Best
Practices guide, and it says your group shouldnt have > 9 disks. So Im
thinking a better configuration would be 2 x 7disk RaidZ2 + 1 x 6disk RaidZ2.
However its 14TB worth of data instead of 16TB.
2 x 10 disk raidz2 should be fine for general storage. It depends on
what your performance needs are.

Or go with 3 x 6 disk vdevs, a spare and a l2arc.

-B

-- 
Brandon High : bhigh at freaks.com

Roy Sigurd Karlsbakk

2010-Sep-06 21:36 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

----- Original Message -----> On Mon, Sep 6, 2010 at 8:53 AM, hatish <hatish at gmail.com> wrote:
> > Im setting up a server with 20x1TB disks. Initially I had thought to
> > setup the disks using 2 RaidZ2 groups of 10 discs. However, I have
> > just read the Best Practices guide, and it says your group shouldnt
> > have > 9 disks. So Im thinking a better configuration would be 2 x
> > 7disk RaidZ2 + 1 x 6disk RaidZ2. However its 14TB worth of data
> > instead of 16TB.
> 
> 2 x 10 disk raidz2 should be fine for general storage. It depends on
> what your performance needs are.
> 
> Or go with 3 x 6 disk vdevs, a spare and a l2arc.
> 
a 7k2 drive for l2arc?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Brandon High

2010-Sep-07 02:10 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

On Mon, Sep 6, 2010 at 2:36 PM, Roy Sigurd Karlsbakk <roy at
karlsbakk.net> wrote:> a 7k2 drive for l2arc?
It wouldn''t be great, but you could put an SSD in the bay instead.

-B

-- 
Brandon High : bhigh at freaks.com

hatish

2010-Sep-07 08:40 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

Thanks for all the replies :)

My mindset is split in two now...

Some detail - I''m using 4 1-to-5 Sata Port multipliers connected to a
4-port SATA raid card.

I only need reliability and size, as long as my performance is the equivalent of
one drive, Im happy.

Im assuming all the data used in the group is read once when re-creating a lost
drive. Also assuming space consumed is 50%.

So option 1 - Stay with the 2 x 10drive RaidZ2. My concern is the stress on the
drives when one drive fails and the others go crazy (read-wise) to re-create the
new drive. Is there no way to reduce this stress? Maybe limit the data rate, so
its not quite so stressful, even though it will end up taking longer? (quite
acceptable)
[Available Space: 16TB, Redundancy Space: 4TB, Repair data read: 4.5TB]

And option 2 - Add a 21st drive to one of the motherboard sata ports. And then
go with 3 x 7drive RaidZ2. [Available Space: 15TB, Redundancy Space: 6TB, Repair
data read: 3TB]

Sadly, SSD''s wont go too well in a PM based setup like mine. I may add
it directly onto the MB if I can afford it. But again, performance is not a
prioity.

Any further thoughts and ideas are much appreciated.
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Sep-07 14:59 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of hatish
> 
> I have just
> read the Best Practices guide, and it says your group shouldnt have > 9
> disks. 
I think the value you can take from this is:
Why does the BPG say that?  What is the reasoning behind it?

Anything that is a "rule of thumb" either has reasoning behind it (you
should know the reasoning) or it doesn''t (you should ignore the rule of
thumb, dismiss it as myth.)

Hatish Narotam

2010-Sep-07 15:15 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

Makes sense. My understanding is not good enough to confidently make my own
decisions, and I''m learning as Im going. The BPG says:

   - The recommended number of disks per group is between 3 and 9. If you
   have more disks, use multiple groups

If there was a reason leading up to this statement, I didnt follow it.

However, a few paragraphs later, their RaidZ2 example says [4x(9+2), 2 hot
spares, 18.0 TB]. So I guess 8+2 should be quite acceptable, especially
since performance is the lowest priority.



On Tue, Sep 7, 2010 at 4:59 PM, Edward Ned Harvey <shill at
nedharvey.com>wrote:
> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> > bounces at opensolaris.org] On Behalf Of hatish
> >
> > I have just
> > read the Best Practices guide, and it says your group shouldnt have
> 9
> > disks.
>
> I think the value you can take from this is:
> Why does the BPG say that?  What is the reasoning behind it?
>
> Anything that is a "rule of thumb" either has reasoning behind it
(you
> should know the reasoning) or it doesn''t (you should ignore the
rule of
> thumb, dismiss it as myth.)
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100907/3144a189/attachment-0001.html>

LaoTsao 老曹

2010-Sep-07 15:17 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

may be 5x(3+1) use one disk from each controller, 15TB usable space, 
3+1 raidz rebuild time should be reasonable


On 9/7/2010 4:40 AM, hatish wrote:> Thanks for all the replies :)
>
> My mindset is split in two now...
>
> Some detail - I''m using 4 1-to-5 Sata Port multipliers connected
to a 4-port SATA raid card.
>
> I only need reliability and size, as long as my performance is the
equivalent of one drive, Im happy.
>
> Im assuming all the data used in the group is read once when re-creating a
lost drive. Also assuming space consumed is 50%.
>
> So option 1 - Stay with the 2 x 10drive RaidZ2. My concern is the stress on
the drives when one drive fails and the others go crazy (read-wise) to re-create
the new drive. Is there no way to reduce this stress? Maybe limit the data rate,
so its not quite so stressful, even though it will end up taking longer? (quite
acceptable)
> [Available Space: 16TB, Redundancy Space: 4TB, Repair data read: 4.5TB]
>
> And option 2 - Add a 21st drive to one of the motherboard sata ports. And
then go with 3 x 7drive RaidZ2. [Available Space: 15TB, Redundancy Space: 6TB,
Repair data read: 3TB]
>
> Sadly, SSD''s wont go too well in a PM based setup like mine. I may
add it directly onto the MB if I can afford it. But again, performance is not a
prioity.
>
> Any further thoughts and ideas are much appreciated.-------------- next part --------------
A non-text attachment was scrubbed...
Name: laotsao.vcf
Type: text/x-vcard
Size: 221 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100907/ef6e3d7c/attachment.vcf>

Edward Ned Harvey

2010-Sep-08 04:59 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

> On Tue, Sep 7, 2010 at 4:59 PM, Edward Ned Harvey <shill at
nedharvey.com>
> wrote:
>
> I think the value you can take from this is:
> Why does the BPG say that? ?What is the reasoning behind it?
> 
> Anything that is a "rule of thumb" either has reasoning behind it
(you
> should know the reasoning) or it doesn''t (you should ignore the
rule of
> thumb, dismiss it as myth.)
Let''s examine the myth that you should limit the number of drives in a
vdev
because of resilver time.  The myth goes something like this:  You
shouldn''t
use more than ___ drives in a vdev raidz_ configuration, because all the
drives need to read during a resilver, so the more drives are present, the
longer the resilver time.

The truth of the matter is:  Only the size of used data is read.  Because
this is ZFS, it''s smarter than a hardware solution which would have to
read
all disks in their entirety.  In ZFS, if you have a 6-disk raidz1 with
capacity of 5 disks, and a total of 50G of data, then each disk has roughly
10G of data in it.  During resilver, 5 disks will each read 10G of data, and
10G of data will be written to the new disk.  If you have a 11-disk raidz1
with capacity of 10 disks, then each disk has roughly 5G of data.  10 disks
will each read 5G of data, and 5G of data will be written to the new disk.
If anything, more disks means a faster resilver, because you''re more
easily
able to saturate the bus, and you have a smaller amount of data that needs
to be written to the replaced disk.

Let''s examine the myth that you should limit the number of disks for
the
sake of redundancy.  It is true that a carefully crafted system can survive
things like SCSI controller or tray failure.  Suppose you have 3 scsi cards.
Suppose you construct a raidz2 device using 2 disks from controller 0, 2
disks from controller 1, and 2 disks from controller 2.  Then if a
controller dies, you have only lost 2 disks, and you are degraded but still
functional as long as you don''t lose another disk.

But you said you have 20 disks all connected to a single controller.  So
none of that matters in your case.

Personally, I can''t imagine any good reason to generalize
"don''t use more
than ___ devices in a vdev."  To me, a 12-disk raidz2 is just as likely to
fail as a 6-disk raidz1.  But a 12-disk raidz2 is slightly more reliable
than having two 6-disk raidz1''s.

Perhaps, maybe, a 64bit processor is able to calculate parity on an 8-disk
raidz set in a single operation, but requires additional operations to
calculate parity if your raidz has 9 or more disks in it ... But I am highly
skeptical of this line of reasoning, and AFAIK, nobody has ever suggested
this before me.  I made it up just now.  I''m grasping at straws and
stretching my imagination to find *any* merit in the statement,
"don''t use
more than ___ disks in a vdev."  I see no reasoning behind it, and unless
somebody can say anything to support it, I think it''s bunk.

Mattias Pantzare

2010-Sep-08 06:31 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

On Wed, Sep 8, 2010 at 06:59, Edward Ned Harvey <shill at nedharvey.com>
wrote:>> On Tue, Sep 7, 2010 at 4:59 PM, Edward Ned Harvey <shill at
nedharvey.com>
>> wrote:
>>
>> I think the value you can take from this is:
>> Why does the BPG say that? ?What is the reasoning behind it?
>>
>> Anything that is a "rule of thumb" either has reasoning
behind it (you
>> should know the reasoning) or it doesn''t (you should ignore
the rule of
>> thumb, dismiss it as myth.)
>
> Let''s examine the myth that you should limit the number of drives
in a vdev
> because of resilver time. ?The myth goes something like this: ?You
shouldn''t
> use more than ___ drives in a vdev raidz_ configuration, because all the
> drives need to read during a resilver, so the more drives are present, the
> longer the resilver time.
>
> The truth of the matter is: ?Only the size of used data is read. ?Because
> this is ZFS, it''s smarter than a hardware solution which would
have to read
> all disks in their entirety. ?In ZFS, if you have a 6-disk raidz1 with
> capacity of 5 disks, and a total of 50G of data, then each disk has roughly
> 10G of data in it. ?During resilver, 5 disks will each read 10G of data,
and
> 10G of data will be written to the new disk. ?If you have a 11-disk raidz1
> with capacity of 10 disks, then each disk has roughly 5G of data. ?10 disks
> will each read 5G of data, and 5G of data will be written to the new disk.
> If anything, more disks means a faster resilver, because you''re
more easily
> able to saturate the bus, and you have a smaller amount of data that needs
> to be written to the replaced disk.
It is not a question of a vdev with 6 disk vs a vdev with 12 disks. It
is about 1 vdev with 12 disk or  2 vdev with 6 disks. If you have 2
vdev you have to read half the data compared to 1 vdev to resilver a
disk.

Or look at it this way, you will put more data on a 12 disk vdev than
on a 6 disk vdev.

IO other than the resilver will also slow the resilver down more if
you have large vdevs.

hatish

2010-Sep-08 08:19 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

Rebuild time is not a concern for me. The concern with rebuilding was the stress
it puts on the disks for an extended period of time (increasing the chances of
another disk failure). The % of data used doesnt matter, as the system will try
to get it done at max speed, thus creating the mentioned stress. But I suspect
the Port multipliers will do a good job of throttling the IO such that the discs
face minimal stress. Thus Im pretty sure I''ll stick with 2 x 10disk
RaidZ2.

Thanks for all the input!
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Sep-08 13:27 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

> From: pantzare at gmail.com [mailto:pantzare at gmail.com] On Behalf Of
> Mattias Pantzare
> 
> It
> is about 1 vdev with 12 disk or  2 vdev with 6 disks. If you have 2
> vdev you have to read half the data compared to 1 vdev to resilver a
> disk.
Let''s suppose you have 1T of data.  You have 12-disk raidz2.  So you
have
approx 100G on each disk, and you replace one disk.  Then 11 disks will each
read 100G, and the new disk will write 100G.

Let''s suppose you have 1T of data.  You have 2 vdev''s that are
each 6-disk
raidz1.  Then we''ll estimate 500G is on each vdev, so each disk has
approx
100G.  You replace a disk.  Then 5 disks will each read 100G, and 1 disk
will write 100G.

Both of the above situations resilver in equal time, unless there is a bus
bottleneck.  21 disks in a single raidz3 will resilver just as fast as 7
disks in a raidz1, as long as you are avoiding the bus bottleneck.  But 21
disks in a single raidz3 provides better redundancy than 3 vdev''s each
containing a 7 disk raidz1.

In my personal experience, approx 5 disks can max out approx 1 bus.  (It
actually ranges from 2 to 7 disks, if you have an imbalance of cheap disks
on a good bus, or good disks on a crap bus, but generally speaking people
don''t do that.  Generally people get a good bus for good disks, and
cheap
disks for crap bus, so approx 5 disks max out approx 1 bus.)  

In my personal experience, servers are generally built with a separate bus
for approx every 5-7 disk slots.  So what it really comes down to is ...

Instead of the Best Practices Guide saying "Don''t put more than
___ disks
into a single vdev," the BPG should say "Avoid the bus bandwidth
bottleneck
by constructing your vdev''s using physical disks which are distributed
across multiple buses, as necessary per the speed of your disks and buses."

Hatish Narotam

2010-Sep-08 14:51 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

Mattias, what you say makes a lot of sense. When I saw *Both of the above
situations resilver in equal time*, I was like "no way!" But like you
said,
assuming no bus bottlenecks.

This is my exact breakdown (cheap disks on cheap bus :P) :

PCI-E 8X 4-port ESata Raid Controller.
4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the
controller).
20 x Samsung 1TB HDD''s. (each connected to a Port Multiplier).

The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each
ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each
PM can give up to 3Gbps, which is shared amongst 5 drives. According to
Samsungs site, max read speed is 250MBps, which translates to 2Gbps.
Multiply by 5 drives gives you 10Gbps. Which is 333% of the PM''s
capability.
So the drives arent likely to hit max read speed for long lengths of time,
especially during rebuild time.

So the bus is going to be quite a bottleneck. Lets assume that the drives
are 80% full. Thats 800GB that needs to be read on each drive, which is
(800x9) 7.2TB.
Best case scenario, we can read 7.2TB at 3Gbps
= 57.6 Tb at 3Gbps
= 57600 Gb at 3Gbps
= 19200 seconds
= 320 minutes
= 5 Hours 20 minutes.

Even if it takes twice that amount of time, Im happy.

Initially I had been thinking 2 PM''s for each vdev. But now Im thinking
maybe split it wide as best I can ([2disks per PM] x 2, [3disks per PM] x 2)
for each vdev. It''ll give the best possible speed, but still wont max
out
the HDD''s.

I''ve never actually sat and done the math before. Hope its decently
accurate
:)

On Wed, Sep 8, 2010 at 3:27 PM, Edward Ned Harvey <shill at
nedharvey.com>wrote:
> > From: pantzare at gmail.com [mailto:pantzare at gmail.com] On Behalf
Of
> > Mattias Pantzare
> >
> > It
> > is about 1 vdev with 12 disk or  2 vdev with 6 disks. If you have 2
> > vdev you have to read half the data compared to 1 vdev to resilver a
> > disk.
>
> Let''s suppose you have 1T of data.  You have 12-disk raidz2.  So
you have
> approx 100G on each disk, and you replace one disk.  Then 11 disks will
> each
> read 100G, and the new disk will write 100G.
>
> Let''s suppose you have 1T of data.  You have 2 vdev''s
that are each 6-disk
> raidz1.  Then we''ll estimate 500G is on each vdev, so each disk
has approx
> 100G.  You replace a disk.  Then 5 disks will each read 100G, and 1 disk
> will write 100G.
>
> Both of the above situations resilver in equal time, unless there is a bus
> bottleneck.  21 disks in a single raidz3 will resilver just as fast as 7
> disks in a raidz1, as long as you are avoiding the bus bottleneck.  But 21
> disks in a single raidz3 provides better redundancy than 3 vdev''s
each
> containing a 7 disk raidz1.
>
> In my personal experience, approx 5 disks can max out approx 1 bus.  (It
> actually ranges from 2 to 7 disks, if you have an imbalance of cheap disks
> on a good bus, or good disks on a crap bus, but generally speaking people
> don''t do that.  Generally people get a good bus for good disks,
and cheap
> disks for crap bus, so approx 5 disks max out approx 1 bus.)
>
> In my personal experience, servers are generally built with a separate bus
> for approx every 5-7 disk slots.  So what it really comes down to is ...
>
> Instead of the Best Practices Guide saying "Don''t put more
than ___ disks
> into a single vdev," the BPG should say "Avoid the bus bandwidth
bottleneck
> by constructing your vdev''s using physical disks which are
distributed
> across multiple buses, as necessary per the speed of your disks and
buses."
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100908/f243f5c6/attachment-0001.html>

Mattias Pantzare

2010-Sep-08 15:15 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

On Wed, Sep 8, 2010 at 15:27, Edward Ned Harvey <shill at nedharvey.com>
wrote:>> From: pantzare at gmail.com [mailto:pantzare at gmail.com] On Behalf Of
>> Mattias Pantzare
>>
>> It
>> is about 1 vdev with 12 disk or ?2 vdev with 6 disks. If you have 2
>> vdev you have to read half the data compared to 1 vdev to resilver a
>> disk.
>
> Let''s suppose you have 1T of data. ?You have 12-disk raidz2. ?So
you have
> approx 100G on each disk, and you replace one disk. ?Then 11 disks will
each
> read 100G, and the new disk will write 100G.
>
> Let''s suppose you have 1T of data. ?You have 2 vdev''s
that are each 6-disk
> raidz1. ?Then we''ll estimate 500G is on each vdev, so each disk
has approx
> 100G. ?You replace a disk. ?Then 5 disks will each read 100G, and 1 disk
> will write 100G.
>
> Both of the above situations resilver in equal time, unless there is a bus
> bottleneck. ?21 disks in a single raidz3 will resilver just as fast as 7
> disks in a raidz1, as long as you are avoiding the bus bottleneck. ?But 21
> disks in a single raidz3 provides better redundancy than 3 vdev''s
each
> containing a 7 disk raidz1.
>
> In my personal experience, approx 5 disks can max out approx 1 bus. ?(It
> actually ranges from 2 to 7 disks, if you have an imbalance of cheap disks
> on a good bus, or good disks on a crap bus, but generally speaking people
> don''t do that. ?Generally people get a good bus for good disks,
and cheap
> disks for crap bus, so approx 5 disks max out approx 1 bus.)
>
> In my personal experience, servers are generally built with a separate bus
> for approx every 5-7 disk slots. ?So what it really comes down to is ...
>
> Instead of the Best Practices Guide saying "Don''t put more
than ___ disks
> into a single vdev," the BPG should say "Avoid the bus bandwidth
bottleneck
> by constructing your vdev''s using physical disks which are
distributed
> across multiple buses, as necessary per the speed of your disks and
buses."
This is assuming that you have no other IO besides the scrub.

You should of course keep the number of disks in a vdev low for
general performance reasons unless you only have linear reads (as your
IOPS will be close to what only one disk can give for the whole vdev).

Freddie Cash

2010-Sep-09 05:08 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

On Wed, Sep 8, 2010 at 6:27 AM, Edward Ned Harvey <shill at nedharvey.com>
wrote:> Both of the above situations resilver in equal time, unless there is a bus
> bottleneck. ?21 disks in a single raidz3 will resilver just as fast as 7
> disks in a raidz1, as long as you are avoiding the bus bottleneck. ?But 21
> disks in a single raidz3 provides better redundancy than 3 vdev''s
each
> containing a 7 disk raidz1.
No, it (21-disk raidz3 vdev) most certainly will not resilver in the
same amount of time.  In fact, I highly doubt it would resilver at
all.

My first foray into ZFS resulted in a 24-disk raidz2 vdev using 500 GB
Seagate ES.2 and WD RE3 drives connected to 3Ware 9550SXU and 9650SE
multilane controllers.  Nice 10 TB storage pool.  Worked beatifully as
we filled it with data.  Had less than 50% usage when a disk died.

No problem, it''s ZFS, it''s meant to be easy to replace a
drive, just
offline, swap, replace, wait for it to resilver.

Well, 3 days later, it was still under 10%, and every disk light was
still solid grrn.  SNMP showed over 100 MB/s of disk I/O continuously,
and the box was basically unusable (5 minutes to get the password line
to appear on the console).

Tried rebooting a few times, stopped all disk I/O to the machine (it
was our backups box, running rysnc every night for - at the time - 50+
remote servers), let it do its thing.

After 3 weeks of trying to get the resilver to complete (or even reach
50%), we pulled the plug and destroyed the pool, rebuilding it using
3x 8-drive raidz2 vdevs.  Things have been a lot smoother ever since.
Have replaced 8 of the drives (1 vdev) with 1.5 TB drives.  Have
replaced multiple dead drives.  Resilvers, while running outgoing
rsync all day and incoming rsync all night, take 3 days for a 1.5 TB
drive (with SNMP showing 300 MB/s disk I/O).

You most definitely do not want to use a single super-wide raidz vdev.
 It just won''t work.
> Instead of the Best Practices Guide saying "Don''t put more
than ___ disks
> into a single vdev," the BPG should say "Avoid the bus bandwidth
bottleneck
> by constructing your vdev''s using physical disks which are
distributed
> across multiple buses, as necessary per the speed of your disks and
buses."
Yeah, I still don''t buy it.  Even spreading disks out such that you
have 4 SATA drives per PCI-X/PCIe bus, I don''t think you''d be
able to
get a 500 GB SATA disk to resilver in a 24-disk raidz vdev (even a
raidz1) in a 50% full pool.  Especially if you are using the pool for
anything at the same time.

-- 
Freddie Cash
fjwcash at gmail.com

Erik Trimble

2010-Sep-09 08:58 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

On 9/8/2010 10:08 PM, Freddie Cash wrote:> On Wed, Sep 8, 2010 at 6:27 AM, Edward Ned Harvey<shill at
nedharvey.com>  wrote:
>> Both of the above situations resilver in equal time, unless there is a
bus
>> bottleneck.  21 disks in a single raidz3 will resilver just as fast as
7
>> disks in a raidz1, as long as you are avoiding the bus bottleneck.  But
21
>> disks in a single raidz3 provides better redundancy than 3
vdev''s each
>> containing a 7 disk raidz1.
> No, it (21-disk raidz3 vdev) most certainly will not resilver in the
> same amount of time.  In fact, I highly doubt it would resilver at
> all.
>
> My first foray into ZFS resulted in a 24-disk raidz2 vdev using 500 GB
> Seagate ES.2 and WD RE3 drives connected to 3Ware 9550SXU and 9650SE
> multilane controllers.  Nice 10 TB storage pool.  Worked beatifully as
> we filled it with data.  Had less than 50% usage when a disk died.
>
> No problem, it''s ZFS, it''s meant to be easy to replace a
drive, just
> offline, swap, replace, wait for it to resilver.
>
> Well, 3 days later, it was still under 10%, and every disk light was
> still solid grrn.  SNMP showed over 100 MB/s of disk I/O continuously,
> and the box was basically unusable (5 minutes to get the password line
> to appear on the console).
>
> Tried rebooting a few times, stopped all disk I/O to the machine (it
> was our backups box, running rysnc every night for - at the time - 50+
> remote servers), let it do its thing.
>
> After 3 weeks of trying to get the resilver to complete (or even reach
> 50%), we pulled the plug and destroyed the pool, rebuilding it using
> 3x 8-drive raidz2 vdevs.  Things have been a lot smoother ever since.
> Have replaced 8 of the drives (1 vdev) with 1.5 TB drives.  Have
> replaced multiple dead drives.  Resilvers, while running outgoing
> rsync all day and incoming rsync all night, take 3 days for a 1.5 TB
> drive (with SNMP showing 300 MB/s disk I/O).
>
> You most definitely do not want to use a single super-wide raidz vdev.
>   It just won''t work.
>
>> Instead of the Best Practices Guide saying "Don''t put
more than ___ disks
>> into a single vdev," the BPG should say "Avoid the bus
bandwidth bottleneck
>> by constructing your vdev''s using physical disks which are
distributed
>> across multiple buses, as necessary per the speed of your disks and
buses."
> Yeah, I still don''t buy it.  Even spreading disks out such that
you
> have 4 SATA drives per PCI-X/PCIe bus, I don''t think
you''d be able to
> get a 500 GB SATA disk to resilver in a 24-disk raidz vdev (even a
> raidz1) in a 50% full pool.  Especially if you are using the pool for
> anything at the same time.
>
>
the thing that folks tend to forget is that RaidZ is IOPS limited.  For 
the most part, if I want to reconstruct a single slab (stripe) of data, 
I have to issue a read to EACH disk in the vdev, and wait for that disk 
to return the value, before I can write the computed parity value out to 
the disk under reconstruction.

This is *regardless* of the amount of data being reconstructed.

So, the bottleneck tends to be the IOPS value of the single disk being 
reconstructed.  Thus, having fewer disks in a vdev leads to less data 
being required to be resilvered, which leads to fewer IOPS being 
required to finish the resilver.

Example (for ease of calculation, let''s do the disk-drive
mfg''s cheat of
1k = 1000 bytes):

Scenario 1:    I have 5 1TB disks in a raidz1, and I assume I have 128k 
slab sizes.  Thus, I have 32k of data for each slab written to each 
disk. (4x32k data + 32k parity for a 128k slab size).  So, each IOPS 
gets to reconstruct 32k of data on the failed drive.   It thus takes 
about 1TB/32k = 31e6 IOPS to reconstruct the full 1TB drive.

Scenario 2:    I have 10 1TB drives in a raidz1, with the same 128k slab 
sizes.  In this case, there''s only about 14k of data on each drive for
a
slab. This means, each IOPS to the failed drive only write 14k.  So, it 
takes 1TB/14k = 71e6 IOPS to complete.

 From this, it can be pretty easy to see that the number of required 
IOPS to the resilvered disk goes up linearly with the number of data 
drives in a vdev.  Since you''re always going to be IOPS bound by the 
single disk resilvering, you have a fixed limit.

In addition, remember that having more disks means you have to wait 
longer for each IOPS to complete.  That is, it takes longer 
(fractionally, but in the aggregate, a measuable amount) for 9 drives to 
each return 14k of info than it does for 4 drives to return 32k of 
data.  This is due to rotational and seek access delays.  So, not only 
are you having to do more total IOPS in Scenario 2, but each IOPS takes 
longer to complete (the read cycle taking longer, the write/reconstruct 
cycle taking the same amount of time).

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

taemun

2010-Sep-09 09:15 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

Erik: does that mean that keeping the number of data drives in a raidz(n) to
a power of two is better? In the example you gave, you mentioned 14kb being
written to each drive. That doesn''t sound very efficient to me.

(when I say the above, I mean a five disk raidz or a ten disk raidz2, etc)

Cheers,

On 9 September 2010 18:58, Erik Trimble <erik.trimble at oracle.com>
wrote:>
> the thing that folks tend to forget is that RaidZ is IOPS limited.  For the
> most part, if I want to reconstruct a single slab (stripe) of data, I have
> to issue a read to EACH disk in the vdev, and wait for that disk to return
> the value, before I can write the computed parity value out to the disk
> under reconstruction.
>
> This is *regardless* of the amount of data being reconstructed.
>
> So, the bottleneck tends to be the IOPS value of the single disk being
> reconstructed.  Thus, having fewer disks in a vdev leads to less data being
> required to be resilvered, which leads to fewer IOPS being required to
> finish the resilver.
>
>
> Example (for ease of calculation, let''s do the disk-drive
mfg''s cheat of 1k
> = 1000 bytes):
>
> Scenario 1:    I have 5 1TB disks in a raidz1, and I assume I have 128k
> slab sizes.  Thus, I have 32k of data for each slab written to each disk.
> (4x32k data + 32k parity for a 128k slab size).  So, each IOPS gets to
> reconstruct 32k of data on the failed drive.   It thus takes about 1TB/32k
> 31e6 IOPS to reconstruct the full 1TB drive.
>
> Scenario 2:    I have 10 1TB drives in a raidz1, with the same 128k slab
> sizes.  In this case, there''s only about 14k of data on each drive
for a
> slab. This means, each IOPS to the failed drive only write 14k.  So, it
> takes 1TB/14k = 71e6 IOPS to complete.
>
>
> From this, it can be pretty easy to see that the number of required IOPS to
> the resilvered disk goes up linearly with the number of data drives in a
> vdev.  Since you''re always going to be IOPS bound by the single
disk
> resilvering, you have a fixed limit.
>
> In addition, remember that having more disks means you have to wait longer
> for each IOPS to complete.  That is, it takes longer (fractionally, but in
> the aggregate, a measuable amount) for 9 drives to each return 14k of info
> than it does for 4 drives to return 32k of data.  This is due to rotational
> and seek access delays.  So, not only are you having to do more total IOPS
> in Scenario 2, but each IOPS takes longer to complete (the read cycle
taking
> longer, the write/reconstruct cycle taking the same amount of time).
>
>
>
> --
> Erik Trimble
> Java System Support
> Mailstop:  usca22-123
> Phone:  x17195
> Santa Clara, CA
> Timezone: US/Pacific (GMT-0800)-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100909/457effa6/attachment.html>

Erik Trimble

2010-Sep-09 09:28 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

On 9/9/2010 2:15 AM, taemun wrote:> Erik: does that mean that keeping the number of data drives in a 
> raidz(n) to a power of two is better? In the example you gave, you 
> mentioned 14kb being written to each drive. That doesn''t sound
very
> efficient to me.
>
> (when I say the above, I mean a five disk raidz or a ten disk raidz2, etc)
>
> Cheers,
>
Well, since the size of a slab can vary (from 512 bytes to 128k), it''s 
hard to say. Length (size) of the slab is likely the better 
determination. Remember each block on a hard drive is 512 bytes (for 
now).  So, it''s really not any more efficient to write 16k than 14k (or
vice versa). Both are integer multiples of 512 bytes.

IIRC, there was something about using a power-of-two number of data 
drives in a RAIDZ, but I can''t remember what that was. It may just be a
phantom memory.


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

hatish

2010-Sep-09 12:49 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

Very interesting...

Well, lets see if we can do the numbers for my setup.
>From a previous post of mine:
[i]This is my exact breakdown (cheap disks on cheap bus :P) :

PCI-E 8X 4-port ESata Raid Controller.
4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the
controller).
20 x Samsung 1TB HDD''s. (each connected to a Port Multiplier).

The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each ESata
port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each PM can
give up to 3Gbps, which is shared amongst 5 drives. According to Samsungs site,
max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives gives
you 10Gbps. Which is 333% of the PM''s capability. So the drives arent
likely to hit max read speed for long lengths of time, especially during rebuild
time.

So the bus is going to be quite a bottleneck. Lets assume that the drives are
80% full. Thats 800GB that needs to be read on each drive, which is (800x9)
7.2TB.
Best case scenario, we can read 7.2TB at 3Gbps
= 57.6 Tb at 3Gbps
= 57600 Gb at 3Gbps
= 19200 seconds
= 320 minutes
= 5 Hours 20 minutes.

Even if it takes twice that amount of time, Im happy.

Initially I had been thinking 2 PM''s for each vdev. But now Im thinking
maybe split it wide as best I can ([2Ddisks per PM] x 2, [2Ddisks&1Pdisk per
PM] x 2) for each vdev. It''ll give the best possible speed, but still
wont max out the HDD''s.

I''ve never actually sat and done the math before. Hope its decently
accurate :)[/i]

My scenario, as from Erik''s post:
Scenario: I have 10 1TB disks in a raidz2, and I have 128k
slab sizes. Thus, I have 16k of data for each slab written to each
disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS
gets to reconstruct 16k of data on the failed drive. It thus takes
about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive.

Lets assume the drives are at 95% capacity, which is a pretty bad scenario. So
thats 7600GB, which is 60800Gb. There will be no other IO while a rebuild is
going.
Best Case: I''ll read at 12Gbps, & write at 3Gbps (4:1). I read 128K
for every 16K I write (8:1). Hence the read bandwidth will be the bottleneck. So
60800Gb @ 12Gbps is 5066s which is 84m27s (Never gonna happen). A more realistic
read of 1.5Gbps gives me 40533s, which is 675m33s, which is 11h15m33s. Which is
a more realistic time to read 7.6TB.
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Sep-09 12:53 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Freddie Cash
> 
> No, it (21-disk raidz3 vdev) most certainly will not resilver in the
> same amount of time.  In fact, I highly doubt it would resilver at
> all.
> 
> My first foray into ZFS resulted in a 24-disk raidz2 vdev using 500 GB
> Seagate ES.2 and WD RE3 drives connected to 3Ware 9550SXU and 9650SE
> multilane controllers.  Nice 10 TB storage pool.  Worked beatifully as
> we filled it with data.  Had less than 50% usage when a disk died.
> 
> No problem, it''s ZFS, it''s meant to be easy to replace a
drive, just
> offline, swap, replace, wait for it to resilver.
> 
> Well, 3 days later, it was still under 10%, and every disk light was
> still solid grrn.  SNMP showed over 100 MB/s of disk I/O continuously,
I don''t believe your situation is typical.  I think you either
encountered a bug, or you had something happening that you weren''t
aware of (scrub, autosnapshots, etc) ... because the only time I''ve
ever seen anything remotely similar to the behavior you described was the bug
I''ve mentioned in other emails, which occurs when disk is 100% full and
a scrub is taking place.

I know it''s not the same bug for you, because you said your pool was
only 50% full.  But I don''t believe that what you saw was normal or
typical.

Erik Trimble

2010-Sep-09 13:03 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

On 9/9/2010 5:49 AM, hatish wrote:> Very interesting...
>
> Well, lets see if we can do the numbers for my setup.
>
>  From a previous post of mine:
>
> [i]This is my exact breakdown (cheap disks on cheap bus :P) :
>
> PCI-E 8X 4-port ESata Raid Controller.
> 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the
controller).
> 20 x Samsung 1TB HDD''s. (each connected to a Port Multiplier).
>
> The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each
ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each PM
can give up to 3Gbps, which is shared amongst 5 drives. According to Samsungs
site, max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives
gives you 10Gbps. Which is 333% of the PM''s capability. So the drives
arent likely to hit max read speed for long lengths of time, especially during
rebuild time.
>
> So the bus is going to be quite a bottleneck. Lets assume that the drives
are 80% full. Thats 800GB that needs to be read on each drive, which is (800x9)
7.2TB.
> Best case scenario, we can read 7.2TB at 3Gbps
> = 57.6 Tb at 3Gbps
> = 57600 Gb at 3Gbps
> = 19200 seconds
> = 320 minutes
> = 5 Hours 20 minutes.
>
> Even if it takes twice that amount of time, Im happy.
>
> Initially I had been thinking 2 PM''s for each vdev. But now Im
thinking maybe split it wide as best I can ([2Ddisks per PM] x 2,
[2Ddisks&1Pdisk per PM] x 2) for each vdev. It''ll give the best
possible speed, but still wont max out the HDD''s.
>
> I''ve never actually sat and done the math before. Hope its
decently accurate :)[/i]
>
> My scenario, as from Erik''s post:
> Scenario: I have 10 1TB disks in a raidz2, and I have 128k
> slab sizes. Thus, I have 16k of data for each slab written to each
> disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS
> gets to reconstruct 16k of data on the failed drive. It thus takes
> about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive.
>
> Lets assume the drives are at 95% capacity, which is a pretty bad scenario.
So thats 7600GB, which is 60800Gb. There will be no other IO while a rebuild is
going.
> Best Case: I''ll read at 12Gbps,&  write at 3Gbps (4:1). I read
128K for every 16K I write (8:1). Hence the read bandwidth will be the
bottleneck. So 60800Gb @ 12Gbps is 5066s which is 84m27s (Never gonna happen). A
more realistic read of 1.5Gbps gives me 40533s, which is 675m33s, which is
11h15m33s. Which is a more realistic time to read 7.6TB.

Actually, your biggest bottleneck will be the IOPS limits of the 
drives.  A 7200RPM SATA drive tops out at 100 IOPS.  Yup. That''s it.

So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100 
IOPS, that means you will finish (best case) in 62.5e4 seconds.  Which 
is over 173 hours. Or, about 7.25 WEEKS.


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Erik Trimble

2010-Sep-09 13:03 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

On 9/9/2010 5:49 AM, hatish wrote:> Very interesting...
>
> Well, lets see if we can do the numbers for my setup.
>
>  From a previous post of mine:
>
> [i]This is my exact breakdown (cheap disks on cheap bus :P) :
>
> PCI-E 8X 4-port ESata Raid Controller.
> 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the
controller).
> 20 x Samsung 1TB HDD''s. (each connected to a Port Multiplier).
>
> The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each
ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each PM
can give up to 3Gbps, which is shared amongst 5 drives. According to Samsungs
site, max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives
gives you 10Gbps. Which is 333% of the PM''s capability. So the drives
arent likely to hit max read speed for long lengths of time, especially during
rebuild time.
>
> So the bus is going to be quite a bottleneck. Lets assume that the drives
are 80% full. Thats 800GB that needs to be read on each drive, which is (800x9)
7.2TB.
> Best case scenario, we can read 7.2TB at 3Gbps
> = 57.6 Tb at 3Gbps
> = 57600 Gb at 3Gbps
> = 19200 seconds
> = 320 minutes
> = 5 Hours 20 minutes.
>
> Even if it takes twice that amount of time, Im happy.
>
> Initially I had been thinking 2 PM''s for each vdev. But now Im
thinking maybe split it wide as best I can ([2Ddisks per PM] x 2,
[2Ddisks&1Pdisk per PM] x 2) for each vdev. It''ll give the best
possible speed, but still wont max out the HDD''s.
>
> I''ve never actually sat and done the math before. Hope its
decently accurate :)[/i]
>
> My scenario, as from Erik''s post:
> Scenario: I have 10 1TB disks in a raidz2, and I have 128k
> slab sizes. Thus, I have 16k of data for each slab written to each
> disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS
> gets to reconstruct 16k of data on the failed drive. It thus takes
> about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive.
>
> Lets assume the drives are at 95% capacity, which is a pretty bad scenario.
So thats 7600GB, which is 60800Gb. There will be no other IO while a rebuild is
going.
> Best Case: I''ll read at 12Gbps,&  write at 3Gbps (4:1). I read
128K for every 16K I write (8:1). Hence the read bandwidth will be the
bottleneck. So 60800Gb @ 12Gbps is 5066s which is 84m27s (Never gonna happen). A
more realistic read of 1.5Gbps gives me 40533s, which is 675m33s, which is
11h15m33s. Which is a more realistic time to read 7.6TB.

Actually, your biggest bottleneck will be the IOPS limits of the 
drives.  A 7200RPM SATA drive tops out at 100 IOPS.  Yup. That''s it.

So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100 
IOPS, that means you will finish (best case) in 62.5e4 seconds.  Which 
is over 173 hours. Or, about 7.25 WEEKS.


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Will Murnane

2010-Sep-09 13:15 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

On Thu, Sep 9, 2010 at 09:03, Erik Trimble <erik.trimble at oracle.com>
wrote:> Actually, your biggest bottleneck will be the IOPS limits of the drives. ?A
> 7200RPM SATA drive tops out at 100 IOPS. ?Yup. That''s it.
>
> So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100
> IOPS, that means you will finish (best case) in 62.5e4 seconds. ?Which is
> over 173 hours. Or, about 7.25 WEEKS.No argument on IOPS, but 173 hours is 7 days, or a little over one week.

Will

Hatish Narotam

2010-Sep-09 13:16 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

Ahhhh, I see. But I think your math is a bit out:

62.5e6 iop @ 100iops
= 625000 seconds
= 10416m
= 173h
= 7D6h.

So 7 days & 6 hours. Thats long, but I can live with it. This isnt for an
enterprise environment. While the length of time is of worry in terms of
increasing the chance another drive will fail, in my mind that is mitigated
by the fact that the drives wont be under major stress during that time. Its
a workable solution.

On Thu, Sep 9, 2010 at 3:03 PM, Erik Trimble <erik.trimble at
oracle.com>wrote:
>  On 9/9/2010 5:49 AM, hatish wrote:
>
>> Very interesting...
>>
>> Well, lets see if we can do the numbers for my setup.
>>
>>  From a previous post of mine:
>>
>> [i]This is my exact breakdown (cheap disks on cheap bus :P) :
>>
>>
>> PCI-E 8X 4-port ESata Raid Controller.
>> 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on
the
>> controller).
>> 20 x Samsung 1TB HDD''s. (each connected to a Port Multiplier).
>>
>> The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there.
Each
>> ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller.
Each
>> PM can give up to 3Gbps, which is shared amongst 5 drives. According to
>> Samsungs site, max read speed is 250MBps, which translates to 2Gbps.
>> Multiply by 5 drives gives you 10Gbps. Which is 333% of the
PM''s capability.
>> So the drives arent likely to hit max read speed for long lengths of
time,
>> especially during rebuild time.
>>
>> So the bus is going to be quite a bottleneck. Lets assume that the
drives
>> are 80% full. Thats 800GB that needs to be read on each drive, which is
>> (800x9) 7.2TB.
>> Best case scenario, we can read 7.2TB at 3Gbps
>> = 57.6 Tb at 3Gbps
>> = 57600 Gb at 3Gbps
>> = 19200 seconds
>> = 320 minutes
>> = 5 Hours 20 minutes.
>>
>> Even if it takes twice that amount of time, Im happy.
>>
>> Initially I had been thinking 2 PM''s for each vdev. But now Im
thinking
>> maybe split it wide as best I can ([2Ddisks per PM] x 2,
[2Ddisks&1Pdisk per
>> PM] x 2) for each vdev. It''ll give the best possible speed,
but still wont
>> max out the HDD''s.
>>
>> I''ve never actually sat and done the math before. Hope its
decently
>> accurate :)[/i]
>>
>> My scenario, as from Erik''s post:
>> Scenario: I have 10 1TB disks in a raidz2, and I have 128k
>> slab sizes. Thus, I have 16k of data for each slab written to each
>> disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS
>> gets to reconstruct 16k of data on the failed drive. It thus takes
>> about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive.
>>
>> Lets assume the drives are at 95% capacity, which is a pretty bad
>> scenario. So thats 7600GB, which is 60800Gb. There will be no other IO
while
>> a rebuild is going.
>> Best Case: I''ll read at 12Gbps,&  write at 3Gbps (4:1). I
read 128K for
>> every 16K I write (8:1). Hence the read bandwidth will be the
bottleneck. So
>> 60800Gb @ 12Gbps is 5066s which is 84m27s (Never gonna happen). A more
>> realistic read of 1.5Gbps gives me 40533s, which is 675m33s, which is
>> 11h15m33s. Which is a more realistic time to read 7.6TB.
>>
>
>
> Actually, your biggest bottleneck will be the IOPS limits of the drives.  A
> 7200RPM SATA drive tops out at 100 IOPS.  Yup. That''s it.
>
> So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100
> IOPS, that means you will finish (best case) in 62.5e4 seconds.  Which is
> over 173 hours. Or, about 7.25 WEEKS.
>
>
> --
> Erik Trimble
> Java System Support
> Mailstop:  usca22-123
> Phone:  x17195
> Santa Clara, CA
> Timezone: US/Pacific (GMT-0800)
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100909/457c45b0/attachment-0001.html>

Edward Ned Harvey

2010-Sep-09 13:19 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Erik Trimble
> 
> the thing that folks tend to forget is that RaidZ is IOPS limited.  For
> the most part, if I want to reconstruct a single slab (stripe) of data,
> I have to issue a read to EACH disk in the vdev, and wait for that disk
> to return the value, before I can write the computed parity value out
> to
> the disk under reconstruction.
If I''m trying to interpret your whole message, Erik, and condense it, I
think I get the following.  Please tell me if and where I''m wrong.

In any given zpool, some number of slabs are used in the whole pool.  In
raidzN, a portion of each slab is written on each disk.  Therefore, during
resilver, if there are a total of 1million slabs used in the zpool, it means
each good disk will need to read 1million partial slabs, and the replaced
disk will need to write 1 million partial slabs.  Each good disk receives a
read request in parallel, and all of them must complete before a write is
given to the new disk.  Each read/write cycle is completed before the next
cycle begins.  (It seems this could be accelerated by allowing all the good
disks to continue reading in parallel instead of waiting, right?)

The conclusion I would reach is:

Given no bus bottleneck:

It is true that resilvering a raidz will be slower with many disks in the
vdev, because the average latency for the worst of N disks will increase as
N increases.  But that effect is only marginal, and bounded between the
average latency of a single disk, and the worst case latency of a single
disk.

The characteristic that *really* makes a big difference is the number of
slabs in the pool.  i.e. if your filesystem is composed of mostly small
files or fragments, versus mostly large unfragmented files.

Edward Ned Harvey

2010-Sep-09 13:20 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

> From: Hatish Narotam [mailto:hatish at gmail.com]
> 
> PCI-E 8X 4-port ESata Raid Controller.
> 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on
> the controller).
> 20 x Samsung 1TB HDD''s. (each connected to a Port Multiplier).
Assuming your disks can all sustain 500Mbit/sec, which I find to be typical
for 7200rpm sata disks, and you have groups of 5 that all have a 3Gbit
upstream bottleneck, it means each of your groups of 5 should be fine in a
raidz1 configuration.

You think that your sata card can do 32Gbit because it''s on a PCIe x8
bus.
I highly doubt it unless you paid a grand or two for your sata controller,
but please prove me wrong.  ;-)  I think the backplane of the sata
controller is more likely either 3G or 6G.  

If it''s 3G, then you should use 4 groups of raidz1.
If it''s 6G, then you can use 2 groups of raidz2 (because 10 drives of
500Mbit can only sustain 5Gbit)
If it''s 12G or higher, then you can make all of your drives one big
vdev of
raidz3.

> According to Samsungs site, max read speed is 250MBps, which
> translates to 2Gbps. Multiply by 5 drives gives you 10Gbps.
I guarantee you this is not a sustainable speed for 7.2krpm sata disks.  You
can get a decent measure of sustainable speed by doing something like:
	(write 1G byte)
	time dd if=/dev/zero of=/some/file bs=1024k count=1024
	(beware: you might get an inaccurate speed measurement here
	due to ram buffering.  See below.)

	(reboot to ensure nothing is in cache)
	(read 1G byte)
	time dd if=/some/file of=/dev/null bs=1024k
	(Now you''re certain you have a good measurement.
	If it matches the measurement you had before,
	that means your original measurement was also
	accurate.  ;-) )

Edward Ned Harvey

2010-Sep-09 13:38 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> The characteristic that *really* makes a big difference is the number
> of
> slabs in the pool.  i.e. if your filesystem is composed of mostly small
> files or fragments, versus mostly large unfragmented files.
Oh, if at least some of my reasoning was correct, there is one valuable
take-away point for hatish:

Given some number X total slabs used in the whole pool.  If you use a single
vdev for the whole pool, you will have X partial slabs written on each disk.
If you have 2 vdev''s, you''ll have approx X/2 partial slabs
written on each
disk.  3 vdevs ~> X/3 partial slabs on each disk.  Therefore, the resilver
time approximately divides by the number of separate vdev''s you are
using in
your pool.

So the largest factor affecting resilver time of a single large vdev versus
many smaller vdev''s is NOT the quantity of data written on each disk,
but
just the fact that fewer slabs are used on each disk when using smaller
vdev''s.

If you want to choose between (a) 21disk raidz3 versus (b) 3 vdevs of each
7disk raidz1, then:  The raidz3 provides better redundancy, but has the
disadvantage that every slab must be partially written on every disk.

Hatish Narotam

2010-Sep-09 13:39 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

Hi,

*The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each
ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller.*

I was simply listing the bandwidth available at the different stages of the
data cycle. The PCIE port gives me 32Gbps. The Sata card gives me a possible
12Gbps. I''d rather be cautious and asuume I''ll get more like
6Gbps, it is a
cheap card after all.

*I guarantee you this is not a sustainable speed for 7.2krpm sata disks.* (I
am well aware :) )

* Which is 333% of the PM''s capability. *

Assuming that it is, 5 drives at that speed will max out my PM 3 times over.
So my PM will automatically throttle the drives speed to a third of that on
the account that the PM will be maxed out.

Thanks for the rough IO speed check :)


On Thu, Sep 9, 2010 at 3:20 PM, Edward Ned Harvey <shill at
nedharvey.com>wrote:
> > From: Hatish Narotam [mailto:hatish at gmail.com]
> >
> > PCI-E 8X 4-port ESata Raid Controller.
> > 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on
> > the controller).
> > 20 x Samsung 1TB HDD''s. (each connected to a Port
Multiplier).
>
> Assuming your disks can all sustain 500Mbit/sec, which I find to be typical
> for 7200rpm sata disks, and you have groups of 5 that all have a 3Gbit
> upstream bottleneck, it means each of your groups of 5 should be fine in a
> raidz1 configuration.
>
> You think that your sata card can do 32Gbit because it''s on a PCIe
x8 bus.
> I highly doubt it unless you paid a grand or two for your sata controller,
> but please prove me wrong.  ;-)  I think the backplane of the sata
> controller is more likely either 3G or 6G.
>
> If it''s 3G, then you should use 4 groups of raidz1.
> If it''s 6G, then you can use 2 groups of raidz2 (because 10 drives
of
> 500Mbit can only sustain 5Gbit)
> If it''s 12G or higher, then you can make all of your drives one
big vdev of
> raidz3.
>
>
> > According to Samsungs site, max read speed is 250MBps, which
> > translates to 2Gbps. Multiply by 5 drives gives you 10Gbps.
>
> I guarantee you this is not a sustainable speed for 7.2krpm sata disks.
>  You
> can get a decent measure of sustainable speed by doing something like:
>        (write 1G byte)
>        time dd if=/dev/zero of=/some/file bs=1024k count=1024
>        (beware: you might get an inaccurate speed measurement here
>        due to ram buffering.  See below.)
>
>        (reboot to ensure nothing is in cache)
>        (read 1G byte)
>        time dd if=/some/file of=/dev/null bs=1024k
>        (Now you''re certain you have a good measurement.
>        If it matches the measurement you had before,
>        that means your original measurement was also
>        accurate.  ;-) )
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100909/c8c67b90/attachment-0001.html>

Marty Scholes

2010-Sep-09 13:39 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

Erik wrote:> Actually, your biggest bottleneck will be the IOPS
> limits of the 
> drives.  A 7200RPM SATA drive tops out at 100 IOPS.
>  Yup. That''s it.
> So, if you need to do 62.5e6 IOPS, and the rebuild
> drive can do just 100 
> IOPS, that means you will finish (best case) in
> 62.5e4 seconds.  Which 
> is over 173 hours. Or, about 7.25 WEEKS.
My OCD is coming out and I will split that hair with you.  173 hours is just
over a week.

This is a fascinating and timely discussion.  My personal (biased and unhindered
by facts) preference is wide stripes RAIDZ3.  Ned is right that I kept reading
that RAIDZx should not exceed _ devices and couldn''t find real numbers
behind those conclusions.

Discussions in this thread have opened my eyes a little and I am in the middle
of deploying a second 22 disk fibre array on home server, so I have been
struggling with the best way to allocate pools.  Up until reading this thread,
the biggest downside to wide stripes, that I was aware of, has been low iops. 
And let''s be clear: while on paper the iops of a wide stripe is the
same as a single disk, it actually is worse.  In truth, the service time for any
request on wide stripe is the service time of the SLOWEST disk for that request.
The slowest disk may vary from request to request, but will always delay the
entire stripe operation.

Since all of the 44 spindles are 15K disks, I am about to convince myself to go
with two pools of wide stripes and keep several spindles for L2ARC and SLOG. 
The thinking is that other background operations (scrub and resilver) can take
place with little impact to application performance, since those will be using
L2ARC and SLOG.

Of course, I could be wrong on any of the above.

Cheers,
Marty
-- 
This message posted from opensolaris.org

Erik Trimble

2010-Sep-09 15:33 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

On 9/9/2010 6:19 AM, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Erik Trimble
>>
>> the thing that folks tend to forget is that RaidZ is IOPS limited.  For
>> the most part, if I want to reconstruct a single slab (stripe) of data,
>> I have to issue a read to EACH disk in the vdev, and wait for that disk
>> to return the value, before I can write the computed parity value out
>> to
>> the disk under reconstruction.
> If I''m trying to interpret your whole message, Erik, and condense
it, I
> think I get the following.  Please tell me if and where I''m wrong.
>
> In any given zpool, some number of slabs are used in the whole pool.  In
> raidzN, a portion of each slab is written on each disk.  Therefore, during
> resilver, if there are a total of 1million slabs used in the zpool, it
means
> each good disk will need to read 1million partial slabs, and the replaced
> disk will need to write 1 million partial slabs.  Each good disk receives a
> read request in parallel, and all of them must complete before a write is
> given to the new disk.  Each read/write cycle is completed before the next
> cycle begins.  (It seems this could be accelerated by allowing all the good
> disks to continue reading in parallel instead of waiting, right?)
>
> The conclusion I would reach is:
>
> Given no bus bottleneck:
>
> It is true that resilvering a raidz will be slower with many disks in the
> vdev, because the average latency for the worst of N disks will increase as
> N increases.  But that effect is only marginal, and bounded between the
> average latency of a single disk, and the worst case latency of a single
> disk.
>
> The characteristic that *really* makes a big difference is the number of
> slabs in the pool.  i.e. if your filesystem is composed of mostly small
> files or fragments, versus mostly large unfragmented files.
>

Oh, and a mea culpa on converting hours to weeks instead of days. I did 
the math, then forgot which unit I was dealing in. Ooops.

Your reading of my posts is correct.  Indeed, the number of slaps is 
critical, as this directly impacts IOPS needed.  One of the very nice 
speedups for resilvering would be the ability to do a larger "read" of
several continguous slabs (as physically laid out on the disks) in a 
single IOPS - the difference between reading a 128k slab portion and 5 
consecutive 64k slab portions is trivial, so the ability to do more than 
one slab at a time would be critical for improving resilver times. I 
have *no* idea how hard this is - given that resilvering currently walks 
the space allocation tree (which is in creation time order), it 
generally doesn''t get good consecutive slab requests this way, so
things
would have to change from being tree-driven to being layout-on-disk-driven.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Haudy Kazemi

2010-Sep-09 23:13 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

Comment at end...

Mattias Pantzare wrote:> On Wed, Sep 8, 2010 at 15:27, Edward Ned Harvey <shill at
nedharvey.com> wrote:
>   
>>> From: pantzare at gmail.com [mailto:pantzare at gmail.com] On
Behalf Of
>>> Mattias Pantzare
>>>
>>> It
>>> is about 1 vdev with 12 disk or  2 vdev with 6 disks. If you have 2
>>> vdev you have to read half the data compared to 1 vdev to resilver
a
>>> disk.
>>>       
>> Let''s suppose you have 1T of data.  You have 12-disk raidz2. 
So you have
>> approx 100G on each disk, and you replace one disk.  Then 11 disks will
each
>> read 100G, and the new disk will write 100G.
>>
>> Let''s suppose you have 1T of data.  You have 2 vdev''s
that are each 6-disk
>> raidz1.  Then we''ll estimate 500G is on each vdev, so each
disk has approx
>> 100G.  You replace a disk.  Then 5 disks will each read 100G, and 1
disk
>> will write 100G.
>>
>> Both of the above situations resilver in equal time, unless there is a
bus
>> bottleneck.  21 disks in a single raidz3 will resilver just as fast as
7
>> disks in a raidz1, as long as you are avoiding the bus bottleneck.  But
21
>> disks in a single raidz3 provides better redundancy than 3
vdev''s each
>> containing a 7 disk raidz1.
>>
>> In my personal experience, approx 5 disks can max out approx 1 bus. 
(It
>> actually ranges from 2 to 7 disks, if you have an imbalance of cheap
disks
>> on a good bus, or good disks on a crap bus, but generally speaking
people
>> don''t do that.  Generally people get a good bus for good
disks, and cheap
>> disks for crap bus, so approx 5 disks max out approx 1 bus.)
>>
>> In my personal experience, servers are generally built with a separate
bus
>> for approx every 5-7 disk slots.  So what it really comes down to is
...
>>
>> Instead of the Best Practices Guide saying "Don''t put
more than ___ disks
>> into a single vdev," the BPG should say "Avoid the bus
bandwidth bottleneck
>> by constructing your vdev''s using physical disks which are
distributed
>> across multiple buses, as necessary per the speed of your disks and
buses."
>>     
>
> This is assuming that you have no other IO besides the scrub.
>
> You should of course keep the number of disks in a vdev low for
> general performance reasons unless you only have linear reads (as your
> IOPS will be close to what only one disk can give for the whole vdev).There is another optimization in the Best Practices Guide that says the 
number of devices in a vdev should be (N+P) with P = 1 (raidz), 2 
(raidz2), or 3 (raidz3) and N equals 2, 4, or 8.
I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level.

I.e. Optimal sizes
RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev
RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev
RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdev


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100909/788d51bc/attachment-0001.html>

Haudy Kazemi

2010-Sep-09 23:15 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

Erik Trimble wrote:>  On 9/9/2010 2:15 AM, taemun wrote:
>> Erik: does that mean that keeping the number of data drives in a 
>> raidz(n) to a power of two is better? In the example you gave, you 
>> mentioned 14kb being written to each drive. That doesn''t sound
very
>> efficient to me.
>>
>> (when I say the above, I mean a five disk raidz or a ten disk raidz2, 
>> etc)
>>
>> Cheers,
>>
>
> Well, since the size of a slab can vary (from 512 bytes to 128k),
it''s
> hard to say. Length (size) of the slab is likely the better 
> determination. Remember each block on a hard drive is 512 bytes (for 
> now).  So, it''s really not any more efficient to write 16k than
14k
> (or vice versa). Both are integer multiples of 512 bytes.
>
> IIRC, there was something about using a power-of-two number of data 
> drives in a RAIDZ, but I can''t remember what that was. It may just
be
> a phantom memory.
Not a phantom memory...

 From Matt Ahrens in a thread titled ''Metaslab alignment on
RAID-Z'':
http://www.opensolaris.org/jive/thread.jspa?messageID=60241
''To eliminate the blank "round up" sectors for power-of-two
blocksizes
of 8k or larger, you should use a power-of-two plus 1 number of disks in 
your raid-z group -- that is, 3, 5, or 9 disks (for double-parity, use a 
power-of-two plus 2 -- that is, 4, 6, or 10). Smaller blocksizes are 
more constrained; for 4k, use 3 or 5 disks (for double parity, use 4 or 
6) and for 2k, use 3 disks (for double parity, use 4).''

These round up sectors are skipped and used as padding to simplify space 
accounting and improve performance.  I may have referred to them as zero 
padding sectors in other posts, however they''re not necessarily zeroed.

See the thread titled ''raidz stripe size (not stripe width)'' 
http://opensolaris.org/jive/thread.jspa?messageID=495351

This looks to be the reasoning behind the optimization in the ZFS Best 
Practices Guide that says the number of devices in a vdev should be 
(N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8.
I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level.

I.e. Optimal sizes
RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev
RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev
RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdev

The best practices guide recommendation of 3-9 devices per vdev appears 
based on RAIDZ1''s optimal size with 3-9 devices when N=1 to 3 in 2^N +
P.

Victor Latushkin in a thread titled ''odd versus even'' said the
same
thing.  Adam Leventhal said this had a ''very slight space-efficiency 
benefit'' in the same thread.
http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg05460.html

---
That said, the recommendations in the Best Practices Guide for RAIDZ2 to 
start with 5 disks and RAIDZ3 to start with 8 disks, do not match with 
the last recommendation.  What is the reasoning behind 5 and 8?  
Reliability vs space?
Start a single-parity RAIDZ (raidz) configuration at 3 disks (2+1)
Start a double-parity RAIDZ (raidz2) configuration at 5 disks (3+2)
Start a triple-parity RAIDZ (raidz3) configuration at 8 disks (5+3)
(N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8

Perhaps the Best Practices Guide should also recommend:
-the use of striped vdevs in order to bring up the IOPS number, 
particularly when using enough hard drives to meet the capacity and 
reliability requirements.
-avoiding slow consumer class drives (fast ones may be okay for some users)
-more sample array configurations for common drive chassis capacities
-consider using a RAIDZ1 main pool with RAIDZ1 backup pool rather than 
higher level RAIDZ or mirroring (touch on the value of backup vs. 
stronger RAIDZ)
-watch out for BIOS or firmware upgrades that change host protected area 
(HPA) settings on drives making them appear smaller than before

The BPG should also resolve this discrepancy:
Storage Pools section: "For production systems, use whole disks rather 
than slices for storage pools for the following reasons"
Additional Cautions for Storage Pools: "Consider planning ahead and 
reserving some space by creating a slice which is smaller than the whole 
disk instead of the whole disk."

---

Other (somewhat) related threads:

 From Darren Dunham in a thread titled ''ZFS raidz2 number of
disks'':
http://groups.google.com/group/comp.unix.solaris/browse_thread/thread/dd1b5997bede5265
''> 1 Why is the recommendation for a raidz2 3-9 disk, what are the
cons
for having 16 in a pool compared to 8?
Reads potentially have to pull data from all data columns to reconstruct 
a filesystem block for verification.  For random read workloads, 
increasing the number of columns in the raidz does not increase the read 
iops.  So limiting the column count usually makes sense (with a cost 
tradeoff).  16 is valid, but not recommended.''

 From Richard Relling in a thread titled ''rethinking RaidZ and Record
size'':
http://opensolaris.org/jive/thread.jspa?threadID=121016
''The raidz pathological worst case is a random read from many-column 
raidz where files have records 128 KB in size. The inflated read problem 
is why it makes sense to match recordsize for fixed record workloads.
This includes CIFS workloads which use 4 KB records. It is also why 
having many columns in the raidz for large records does not improve 
performance. Hence the 3 to 9 raidz disk limit recommendation in the 
zpool man page.''

 From Adam Leventhal in a thread titled ''Problem with RAID-Z in builds 
snv_120 - snv_123'':
http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg28907.html
''Basically, RAID-Z writes full stripes every time; note that without 
careful accounting it would be possible to effectively fragment the vdev 
such that single sectors were free but useless since single-parity 
RAID-Z requires two adjacent sectors to store data (one for data, one 
for parity). To address this, RAID-Z rounds up its allocation to the 
next (nparity + 1). This ensures that all space is accounted for. RAID-Z 
will thus skip sectors that are unused based on this rounding. For 
example, under raidz1 a write of 1024 bytes would result in 512 bytes of 
parity, 512 bytes of data on two devices and 512 bytes skipped.

To improve performance, ZFS aggregates multiple adjacent IOs into a 
single large IO. Further, hard drives themselves can perform aggregation 
of adjacent IOs. We noted that these skipped sectors were inhibiting 
performance so added "optional" IOs that could be used to improve 
aggregation. This yielded a significant performance boost for all RAID-Z 
configurations.''

 From Adam Leventhal in a thread titled ''triple-parity:
RAID-Z3'':
http://opensolaris.org/jive/thread.jspa?threadID=108154
''> So I''m not sure what the ''RAID-Z should mind
the gap on writes''
 > comment is getting at either.
 >
 > Clarification?

I''m planning to write a blog post describing this, but the basic
problem
is that RAID-Z, by virtue of supporting variable stripe writes (the 
insight that allows us to avoid the RAID-5 write hole), must round the 
number of sectors up to a multiple of nparity+1. This means that we may 
have sectors that are effectively skipped. ZFS generally lays down data 
in large contiguous streams, but these skipped sectors can stymie both 
ZFS''s write aggregation as well as the hard drive''s ability to
group
I/Os and write them quickly.

Jeff Bonwick added some code to mind these gaps on reads. The key 
insight there is that if we''re going to read 64K, say, with a 512 byte 
hole in the middle, we might as well do one big read rather than two 
smaller reads and just throw out the data that we don''t care about.

Of course, doing this for writes is a bit trickier since we can''t just 
blithely write over gaps as those might contain live data on the disk. 
To solve this we push the knowledge of those skipped sectors down to the 
I/O aggregation layer in the form of ''optional'' I/Os purely
for the
purpose of coalescing writes into larger chunks.''

Edward Ned Harvey

2010-Sep-10 03:20 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

> From: Haudy Kazemi [mailto:kaze0010 at umn.edu]
>
> There is another optimization in the Best Practices Guide that says the
> number of devices in a vdev should be (N+P) with P = 1 (raidz), 2
> (raidz2), or 3 (raidz3) and N equals 2, 4, or 8.
> I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level.
> 
> I.e. Optimal sizes
> RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev
> RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev
> RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdev
This sounds logical, although I don''t know how real it is.  The logic
seems
to be ... Assuming slab sizes of 128K, the amount of data written to each
disk within the vdev gets divided into something which is a multiple of 512b
or 4K (newer drives supposedly starting to use 4K block sizes instead of
512b).  

But I have doubts about the real-ness here, because ... An awful lot of
times, your actual slabs are smaller than 128K just because you''re not
performing sustained sequential writes very often.

But it seems to make sense, whenever you *do* have some sequential writes,
you would want the data written to each disk to be a multiple of 512b or 4K.
If you had a 128K slab, divided into 5, then each disk would write 25.6K and
even for sustained sequential writes, some degree of fragmentation would be
impossible to avoid.  Actually, I don''t think fragmentation is
techinically
the correct term for that behavior.  It might be more appropriate to simply
say it forces a less-than-100% duty cycle.

And another thing ... Doesn''t the checksum take up some space anyway? 
Even
if you obeyed the BPG and used ... let''s say ... 4 disks for N ... then
each
disk has 32K of data to write, which is a multiple of 4K and 512b ... but
each disk also needs to write the checksum.  So each disk writes 32K + a few
bytes.  Which defeats the whole purpose anyway, doesn''t it?

The effect, if real at all, might be negligible.  I don''t know how
small it
is, but I''m quite certain it''s not huge.

hatish

2010-Sep-10 06:44 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

Ahhh! So thats how the formula works. That makes perfect sense.

Lets take my case as a scenario:

Each of my vdevs is 10 disk RaidZ2 (8 data + 2 Parity). Using 128K stripe,
I''ll have 128K/8 = 16K blocks per data drive & 16K blocks per
parity drive. That fits both 512B & 4KB.

It works in my favour that I''ll have high average file sizes
(>250MB). So I''ll see minimal effect of the
"fragmentation" mentioned.
-- 
This message posted from opensolaris.org

Richard Elling

2010-Sep-11 01:00 UTC

head link

[zfs-discuss] Suggested RaidZ configuration...

On Sep 9, 2010, at 6:39 AM, Marty Scholes wrote:
> Erik wrote:
>> Actually, your biggest bottleneck will be the IOPS
>> limits of the 
>> drives.  A 7200RPM SATA drive tops out at 100 IOPS.
>> Yup. That''s it.
>> So, if you need to do 62.5e6 IOPS, and the rebuild
>> drive can do just 100 
>> IOPS, that means you will finish (best case) in
>> 62.5e4 seconds.  Which 
>> is over 173 hours. Or, about 7.25 WEEKS.
> 
> My OCD is coming out and I will split that hair with you.  173 hours is
just over a week.
> 
> This is a fascinating and timely discussion.  My personal (biased and
unhindered by facts) preference is wide stripes RAIDZ3.  Ned is right that I
kept reading that RAIDZx should not exceed _ devices and couldn''t find
real numbers behind those conclusions.
There isn''t a real number.  We know that a 46-disk raidz stripe is a
recipe for
unhappiness (because people actually tried that when the thumper was released)
And we know that a 2-disk raidz1 is kinda like mirroring -- a hard sell.  So we
had
to find a number that was between the two, somewhere in the realm of reasonable.
> Discussions in this thread have opened my eyes a little and I am in the
middle of deploying a second 22 disk fibre array on home server, so I have been
struggling with the best way to allocate pools.
Simple, mirror it and be happy :-).
> Up until reading this thread, the biggest downside to wide stripes, that I
was aware of, has been low iops.  And let''s be clear: while on paper
the iops of a wide stripe is the same as a single disk, it actually is worse. 
In truth, the service time for any request on wide stripe is the service time of
the SLOWEST disk for that request.  The slowest disk may vary from request to
request, but will always delay the entire stripe operation.
Yes, but this is not a problem for async writes, so it will depend on the
workload.
> Since all of the 44 spindles are 15K disks, I am about to convince myself
to go with two pools of wide stripes and keep several spindles for L2ARC and
SLOG.  The thinking is that other background operations (scrub and resilver) can
take place with little impact to application performance, since those will be
using L2ARC and SLOG.
> 
> Of course, I could be wrong on any of the above.
If you get it wrong, you can reconfigure most things on the fly.  Except you
can''t
add columns to a raidz or shrink. A good strategy is to start with what you need
and add disks as capacity requires.  Oh, and by the way, the easiest way to do
that is with mirrors :-)  But if you insist on raidz, then consider something
like
6-way or 8-way sets because that is the typical denominator for most hardware
trays today.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
ZFS and performance consulting
http://www.RichardElling.com

zfs discuss - Sep 2010 - Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...

[zfs-discuss] Suggested RaidZ configuration...