thr3ads.net - zfs discuss - [zfs-discuss] # devices in raidz. [Nov 2006]

If this information is useful, please help other people find it:
Share via:

ozan s. yigit

2006-Nov-03 14:57 UTC

[zfs-discuss] # devices in raidz.

for s10u2, documentation recommends 3 to 9 devices in raidz. what is the
basis for this recommendation? i assume it is performance and not failure
resilience, but i am just guessing... [i know, recommendation was intended
for people who know their raid cold, so it needed no further explanation]

thanks... oz
-- 
ozan s. yigit | oz at somanetworks.com | 416 977 1414 x 1540
I have a hard time getting enough time to do even trivial
blogging: being truly thoughtful takes a lot of time. -- james gosling

Robert Milkowski

2006-Nov-03 15:02 UTC

head link

[zfs-discuss] # devices in raidz.

Hello ozan,

Friday, November 3, 2006, 3:57:00 PM, you wrote:

osy> for s10u2, documentation recommends 3 to 9 devices in raidz. what is the
osy> basis for this recommendation? i assume it is performance and not
failure
osy> resilience, but i am just guessing... [i know, recommendation was
intended
osy> for people who know their raid cold, so it needed no further
explanation]

Performance reason for random reads.

ps. however the bigger raid-z group the more risky it could be - but
    this is obvious.


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Richard Elling - PAE

2006-Nov-03 17:44 UTC

head link

[zfs-discuss] # devices in raidz.

ozan s. yigit wrote:> for s10u2, documentation recommends 3 to 9 devices in raidz. what is the
> basis for this recommendation? i assume it is performance and not failure
> resilience, but i am just guessing... [i know, recommendation was intended
> for people who know their raid cold, so it needed no further explanation]
Both actually.
The small, random read performance will approximate that of a single disk.
The probability of data loss increases as you add disks to a RAID-5/6/Z/Z2
volumes.

For example, suppose you have 12 disks and insist on RAID-Z.
Given
	1. small, random read iops for a single disk is 141 (eg. 2.5" SAS
	   10k rpm drive)
	2. MTBF = 1.4M hours (0.63% AFR) (so says the disk vendor)
	3. no spares
	4. service time = 24 hours, resync rate 100 GBytes/hr, 50% space
	   utilization
	5. infinite service life

Scenario 1: 12-way RAID-Z
	performance = 141 iops
	MTTDL[1] = 68,530 years
	space = 11 * disk size

Scenario 2: 2x 6-way RAID-Z+0
	performance = 282 iops
	MTTDL[1] = 150,767 years
	space = 10 * disk size

[1] Using MTTDL = MTBF^2 / (N * (N-1) * MTTR)

  -- richard

Al Hopper

2006-Nov-03 23:08 UTC

head link

[zfs-discuss] # devices in raidz.

On Fri, 3 Nov 2006, Richard Elling - PAE wrote:
> ozan s. yigit wrote:
> > for s10u2, documentation recommends 3 to 9 devices in raidz. what is
the
> > basis for this recommendation? i assume it is performance and not
failure
> > resilience, but i am just guessing... [i know, recommendation was
intended
> > for people who know their raid cold, so it needed no further
explanation]
>
> Both actually.
> The small, random read performance will approximate that of a single disk.
> The probability of data loss increases as you add disks to a RAID-5/6/Z/Z2
> volumes.
>
> For example, suppose you have 12 disks and insist on RAID-Z.
> Given
> 	1. small, random read iops for a single disk is 141 (eg. 2.5" SAS
> 	   10k rpm drive)
> 	2. MTBF = 1.4M hours (0.63% AFR) (so says the disk vendor)
> 	3. no spares
> 	4. service time = 24 hours, resync rate 100 GBytes/hr, 50% space
> 	   utilization
> 	5. infinite service life
>
> Scenario 1: 12-way RAID-Z
> 	performance = 141 iops
> 	MTTDL[1] = 68,530 years
> 	space = 11 * disk size
>
> Scenario 2: 2x 6-way RAID-Z+0
> 	performance = 282 iops
> 	MTTDL[1] = 150,767 years
> 	space = 10 * disk size
>
> [1] Using MTTDL = MTBF^2 / (N * (N-1) * MTTR)
But ... I''m not sure I buy into your numbers given the probability that
more than one disk will fail inside the service window - given that the
disks are identical?  Or ... a disk failure occurs at 5:01 PM (quitting
time) on a Friday and won''t be replaced until 8:00AM on Monday morning.
Does the failure data you have access to support my hypothesis that
failures of identical mechanical systems tend to occur in small clusters
within a relatively small window of time?

Call me paranoid, but I''d prefer to see a product like thumper
configured
with 50% of the disks manufactured by vendor A and the other 50%
manufactured by someone else.

This paranoia is based on a personal experience, many years ago (before we
had smart fans etc), where we had a rack full of expensive custom
equipment cooled by (what we thought was) a highly redundant group of 5
fans.  One fan suffered infant mortality and its failure went unnoticed,
leaving 4 fans running.  Two of the fans died on the same extended weekend
(public holiday).  It was an expensive and embarassing disaster.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
             OpenSolaris Governing Board (OGB) Member - Feb 2006

Richard Elling - PAE

2006-Nov-03 23:46 UTC

head link

[zfs-discuss] # devices in raidz.

Al Hopper wrote:>> [1] Using MTTDL = MTBF^2 / (N * (N-1) * MTTR)
> 
> But ... I''m not sure I buy into your numbers given the probability
that
> more than one disk will fail inside the service window - given that the
> disks are identical?  Or ... a disk failure occurs at 5:01 PM (quitting
> time) on a Friday and won''t be replaced until 8:00AM on Monday
morning.
> Does the failure data you have access to support my hypothesis that
> failures of identical mechanical systems tend to occur in small clusters
> within a relatively small window of time?
Separating the right hand side:
	MTTDL = MTBF/N * MTBF/(N-1)*MTTR

the right-most product is the probability that one of the N-1 disks fail
during the recovery window for the first disk''s failure.  As the MTTR
increases, the probability of the 2nd disk failure also increases.
RAIDoptimizer calculates the MTTR as:
	MTTR = service response time + resync time
where
	resync time = size * space used (%) / resync rate

Incidentally, since ZFS schedules the resync iops itself, then it can
really move along on a mostly idle system.  You should be able to resync
at near the media speed for an idle system.  By contrast, a hardware
RAID array has no knowledge of the context of the data or the I/O scheduling,
so they will perform resyncs using a throttle.  Not only do they end up
resyncing unused space, but they also take a long time (4-18 GBytes/hr for
some arrays) and thus expose you to a higher probability of second disk
failure.
> Call me paranoid, but I''d prefer to see a product like thumper
configured
> with 50% of the disks manufactured by vendor A and the other 50%
> manufactured by someone else.
Diversity is usually a good thing.  Unfortunately, this is often impractical
for a manufacturer.
> This paranoia is based on a personal experience, many years ago (before we
> had smart fans etc), where we had a rack full of expensive custom
> equipment cooled by (what we thought was) a highly redundant group of 5
> fans.  One fan suffered infant mortality and its failure went unnoticed,
> leaving 4 fans running.  Two of the fans died on the same extended weekend
> (public holiday).  It was an expensive and embarassing disaster.
Modelling such as this assumes independence of failures.  Common cause or
bad lots are not that hard to model, but you may never find any failure rate
data for them.  You can look at the MTBF sensitivities, though that is an
opening to another set of results.  I prefer to ignore the absolute values
and judge competing designs by their relative results.  To wit, I fully
expect to be beyond dust in 150,767 years, and the expected lifetime of
most disks is 5 years.  But given two competing designs using the same
model, a design predicting and MTTDL 150,767 years will very likely demonstrate
better MTTDL than a design predicting 68,530 years.
  -- richard

Torrey McMahon

2006-Nov-06 21:23 UTC

head link

[zfs-discuss] # devices in raidz.

Richard Elling - PAE wrote:>
>
> Incidentally, since ZFS schedules the resync iops itself, then it can
> really move along on a mostly idle system.  You should be able to resync
> at near the media speed for an idle system.  By contrast, a hardware
> RAID array has no knowledge of the context of the data or the I/O 
> scheduling,
> so they will perform resyncs using a throttle.  Not only do they end up
> resyncing unused space, but they also take a long time (4-18 GBytes/hr 
> for
> some arrays) and thus expose you to a higher probability of second disk
> failure. 
Just as an other data point: It is true that the array doesn''t know the
context of the data or the i/o scheduling but some arrays do watch the 
incoming data rate and throttle accordingly. (T3 used to for example.)

Robert Milkowski

2006-Nov-07 12:32 UTC

head link

[zfs-discuss] # devices in raidz.

Hello Richard,

Saturday, November 4, 2006, 12:46:05 AM, you wrote:


REP> Incidentally, since ZFS schedules the resync iops itself, then it can
REP> really move along on a mostly idle system.  You should be able to resync
REP> at near the media speed for an idle system.  By contrast, a hardware
REP> RAID array has no knowledge of the context of the data or the I/O
scheduling,
REP> so they will perform resyncs using a throttle.  Not only do they end up
REP> resyncing unused space, but they also take a long time (4-18 GBytes/hr
for
REP> some arrays) and thus expose you to a higher probability of second disk
REP> failure.


However some mechanism to slow or freeze scrub/resilvering would be
useful. Especially in cases where server does many other things and
not only file serving - and scrub/resilver can take much CPU power on
slower servers.

Something like ''zpool scrub -r 10 pool'' - which would mean 10%
of
speed.


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Richard Elling - PAE

2006-Nov-07 16:19 UTC

head link

[zfs-discuss] # devices in raidz.

Robert Milkowski wrote:> Saturday, November 4, 2006, 12:46:05 AM, you wrote:
> REP> Incidentally, since ZFS schedules the resync iops itself, then it
can
> REP> really move along on a mostly idle system.  You should be able to
resync
> REP> at near the media speed for an idle system.  By contrast, a
hardware
> REP> RAID array has no knowledge of the context of the data or the I/O
scheduling,
> REP> so they will perform resyncs using a throttle.  Not only do they
end up
> REP> resyncing unused space, but they also take a long time (4-18
GBytes/hr for
> REP> some arrays) and thus expose you to a higher probability of second
disk
> REP> failure.
> 
> However some mechanism to slow or freeze scrub/resilvering would be
> useful. Especially in cases where server does many other things and
> not only file serving - and scrub/resilver can take much CPU power on
> slower servers.
> 
> Something like ''zpool scrub -r 10 pool'' - which would
mean 10% of
> speed.
I think this has some merit for scrubs, but I wouldn''t suggest it for
resilver.
If your data is at risk, there is nothing more important than protecting it.
While that sounds harsh, in reality there is a practical limit determined by
the ability of a single LUN to absorb a (large, sequential?) write workload.
For JBODs, that would be approximately the media speed.

The big question, though, is "10% of what?"  User CPU?  iops?
  -- richard

Daniel Rock

2006-Nov-07 16:47 UTC

head link

[zfs-discuss] # devices in raidz.

Richard Elling - PAE schrieb:> The big question, though, is "10% of what?"  User CPU?  iops?
Maybe something like the "slow" parameter of VxVM?

                slow[=iodelay]
                     Reduces toe system performance impact of copy
                     operations.  Such operations are usually per-
                     formed on small regions of the  volume  (nor-
                     mally  from  16  kilobytes to 128 kilobytes).
                     This  option  inserts  a  delay  between  the
                     recovery  of  each  such  region . A specific
                     delay can be  specified  with  iodelay  as  a
                     number  of milliseconds; otherwise, a default
                     is chosen (normally 250 milliseconds).



Daniel

Richard Elling - PAE

2006-Nov-07 17:12 UTC

head link

[zfs-discuss] # devices in raidz.

Daniel Rock wrote:> Richard Elling - PAE schrieb:
>> The big question, though, is "10% of what?"  User CPU?  iops?
> 
> Maybe something like the "slow" parameter of VxVM?
> 
>                slow[=iodelay]
>                     Reduces toe system performance impact of copy
>                     operations.  Such operations are usually per-
>                     formed on small regions of the  volume  (nor-
>                     mally  from  16  kilobytes to 128 kilobytes).
>                     This  option  inserts  a  delay  between  the
>                     recovery  of  each  such  region . A specific
>                     delay can be  specified  with  iodelay  as  a
>                     number  of milliseconds; otherwise, a default
>                     is chosen (normally 250 milliseconds).
For modern machines, which *should* be the design point, the channel
bandwidth is underutilized, so why not use it?

NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
to resilver a single 500 GByte drive -- feeling lucky?  In the bad old
days when disks were small, and the systems were slow, this made some
sense.  The better approach is for the file system to do what it needs
to do as efficiently as possible, which is the current state of ZFS.
  -- richard

Torrey McMahon

2006-Nov-07 17:49 UTC

head link

[zfs-discuss] # devices in raidz.

Richard Elling - PAE wrote:> The better approach is for the file system to do what it needs
> to do as efficiently as possible, which is the current state of ZFS. 
This implies that the filesystem has exclusive use of the channel - SAN 
or otherwise - as well as the storage array front end controllers, 
cache, and the raid groups that may be behind it. What we really need in 
this case, and a few others, is the filesystem and backend storage 
working together...but I''ll save that rant for an other day. ;)

Daniel Rock

2006-Nov-07 19:36 UTC

head link

[zfs-discuss] # devices in raidz.

Richard Elling - PAE schrieb:> For modern machines, which *should* be the design point, the channel
> bandwidth is underutilized, so why not use it?
And what about encrypted disks? Simply create a zpool with checksum=sha256, 
fill it up, then scrub. I''d be happy if I could use my machine during 
scrubbing. A throttling of scrubbing would help. Maybe also running the 
scrubbing with a "high nice level" in kernel.


> NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
> to resilver a single 500 GByte drive -- feeling lucky?
250ms is the Veritas default. It doesn''t have to be the ZFS default
also.


Daniel

Robert Milkowski

2006-Nov-09 22:50 UTC

head link

[zfs-discuss] # devices in raidz.

Hello Richard,

Tuesday, November 7, 2006, 5:19:07 PM, you wrote:

REP> Robert Milkowski wrote:>> Saturday, November 4, 2006, 12:46:05 AM, you wrote:
>> REP> Incidentally, since ZFS schedules the resync iops itself, then
it can
>> REP> really move along on a mostly idle system.  You should be able
to resync
>> REP> at near the media speed for an idle system.  By contrast, a
hardware
>> REP> RAID array has no knowledge of the context of the data or the
I/O scheduling,
>> REP> so they will perform resyncs using a throttle.  Not only do
they end up
>> REP> resyncing unused space, but they also take a long time (4-18
GBytes/hr for
>> REP> some arrays) and thus expose you to a higher probability of
second disk
>> REP> failure.
>> 
>> However some mechanism to slow or freeze scrub/resilvering would be
>> useful. Especially in cases where server does many other things and
>> not only file serving - and scrub/resilver can take much CPU power on
>> slower servers.
>> 
>> Something like ''zpool scrub -r 10 pool'' - which would
mean 10% of
>> speed.
REP> I think this has some merit for scrubs, but I wouldn''t suggest
it for resilver.
REP> If your data is at risk, there is nothing more important than protecting
it.
REP> While that sounds harsh, in reality there is a practical limit
determined by
REP> the ability of a single LUN to absorb a (large, sequential?) write
workload.
REP> For JBODs, that would be approximately the media speed.

I can''t agree. I have some performance sensitive environments and I
know that during the day I do not want loose performance even if it
means longer resilvering times. That''s exactly what I do on HW RAID
controller. In other environments I do want to resilver ASAP, you''re
right.

Also scrub can consume all CPU power on smaller and older machines and
that''s not always what I would like.


REP> The big question, though, is "10% of what?"  User CPU?  iops?

Just slower the reate resilvering/scrub is done.
Insert some kind of delay as some one else suggested.


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Al Hopper

2006-Nov-10 13:21 UTC

head link

[zfs-discuss] # devices in raidz.

On Thu, 9 Nov 2006, Robert Milkowski wrote:
> Hello Richard,
>
> Tuesday, November 7, 2006, 5:19:07 PM, you wrote:
>
> REP> Robert Milkowski wrote:
> >> Saturday, November 4, 2006, 12:46:05 AM, you wrote:
> >> REP> Incidentally, since ZFS schedules the resync iops itself,
then it can
> >> REP> really move along on a mostly idle system.  You should be
able to resync
> >> REP> at near the media speed for an idle system.  By contrast,
a hardware
> >> REP> RAID array has no knowledge of the context of the data or
the I/O scheduling,
> >> REP> so they will perform resyncs using a throttle.  Not only
do they end up
> >> REP> resyncing unused space, but they also take a long time
(4-18 GBytes/hr for
> >> REP> some arrays) and thus expose you to a higher probability
of second disk
> >> REP> failure.
> >>
> >> However some mechanism to slow or freeze scrub/resilvering would
be
> >> useful. Especially in cases where server does many other things
and
> >> not only file serving - and scrub/resilver can take much CPU power
on
> >> slower servers.
> >>
> >> Something like ''zpool scrub -r 10 pool'' - which
would mean 10% of
> >> speed.
>
> REP> I think this has some merit for scrubs, but I wouldn''t
suggest it for resilver.
> REP> If your data is at risk, there is nothing more important than
protecting it.
> REP> While that sounds harsh, in reality there is a practical limit
determined by
> REP> the ability of a single LUN to absorb a (large, sequential?) write
workload.
> REP> For JBODs, that would be approximately the media speed.
>
> I can''t agree. I have some performance sensitive environments and
I
> know that during the day I do not want loose performance even if it
> means longer resilvering times. That''s exactly what I do on HW
RAID
> controller. In other environments I do want to resilver ASAP,
you''re
> right.
>
> Also scrub can consume all CPU power on smaller and older machines and
> that''s not always what I would like.
>
>
> REP> The big question, though, is "10% of what?"  User CPU? 
iops?
Probably N% of I/O Ops/Second would work well.
> Just slower the reate resilvering/scrub is done.
> Insert some kind of delay as some one else suggested.
>
>
Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
             OpenSolaris Governing Board (OGB) Member - Feb 2006

Robert Milkowski

2006-Nov-10 14:51 UTC

head link

[zfs-discuss] # devices in raidz.

Hello Al,

Friday, November 10, 2006, 2:21:38 PM, you wrote:

AH> On Thu, 9 Nov 2006, Robert Milkowski wrote:
>> Hello Richard,
>>
>> Tuesday, November 7, 2006, 5:19:07 PM, you wrote:
>>
>> REP> Robert Milkowski wrote:
>> >> Saturday, November 4, 2006, 12:46:05 AM, you wrote:
>> >> REP> Incidentally, since ZFS schedules the resync iops
itself, then it can
>> >> REP> really move along on a mostly idle system.  You should
be able to resync
>> >> REP> at near the media speed for an idle system.  By
contrast, a hardware
>> >> REP> RAID array has no knowledge of the context of the data
or the I/O scheduling,
>> >> REP> so they will perform resyncs using a throttle.  Not
only do they end up
>> >> REP> resyncing unused space, but they also take a long time
(4-18 GBytes/hr for
>> >> REP> some arrays) and thus expose you to a higher
probability of second disk
>> >> REP> failure.
>> >>
>> >> However some mechanism to slow or freeze scrub/resilvering
would be
>> >> useful. Especially in cases where server does many other
things and
>> >> not only file serving - and scrub/resilver can take much CPU
power on
>> >> slower servers.
>> >>
>> >> Something like ''zpool scrub -r 10 pool'' -
which would mean 10% of
>> >> speed.
>>
>> REP> I think this has some merit for scrubs, but I wouldn''t
suggest it for resilver.
>> REP> If your data is at risk, there is nothing more important than
protecting it.
>> REP> While that sounds harsh, in reality there is a practical limit
determined by
>> REP> the ability of a single LUN to absorb a (large, sequential?)
write workload.
>> REP> For JBODs, that would be approximately the media speed.
>>
>> I can''t agree. I have some performance sensitive environments
and I
>> know that during the day I do not want loose performance even if it
>> means longer resilvering times. That''s exactly what I do on HW
RAID
>> controller. In other environments I do want to resilver ASAP,
you''re
>> right.
>>
>> Also scrub can consume all CPU power on smaller and older machines and
>> that''s not always what I would like.
>>
>>
>> REP> The big question, though, is "10% of what?"  User
CPU?  iops?
AH> Probably N% of I/O Ops/Second would work well.

Or if 100% means full speed, then 10% means that expected time should
be approximately 10x more (instead 1h make it 10h).

It would be more intuitive than specifying some numbers like IOPS,
etc.

Additionally setting it to 0 means freeze, then setting it above 0%
means continue (continue not start from the beginning).


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Torrey McMahon

2006-Nov-10 22:31 UTC

head link

[zfs-discuss] # devices in raidz.

Robert Milkowski wrote:>>> Also scrub can consume all CPU power on smaller and older machines
and
>>> that''s not always what I would like.
>>>
>>>
>>> REP> The big question, though, is "10% of what?"  User
CPU?  iops?
>>>       
>
> AH> Probably N% of I/O Ops/Second would work well.
>
> Or if 100% means full speed, then 10% means that expected time should
> be approximately 10x more (instead 1h make it 10h).
>
> It would be more intuitive than specifying some numbers like IOPS,
> etc.
In any case you''re still going to have to provide a tunable for this 
even if the resulting algorithm works well on the host side. Keep in 
mind that a scrub can also impact the array(s) you''re filesystem lives 
on. If all my ZFS systems started scrubbing at full speed - Because they 
thought they weren''t busy - at the same time it might cause issues with
other I/O on the array itself.

Robert Milkowski

2006-Nov-11 08:44 UTC

head link

[zfs-discuss] # devices in raidz.

Hello Torrey,

Friday, November 10, 2006, 11:31:31 PM, you wrote:

TM> Robert Milkowski wrote:>>>> Also scrub can consume all CPU power on smaller and older
machines and
>>>> that''s not always what I would like.
>>>>
>>>>
>>>> REP> The big question, though, is "10% of what?" 
User CPU?  iops?
>>>>       
>>
>> AH> Probably N% of I/O Ops/Second would work well.
>>
>> Or if 100% means full speed, then 10% means that expected time should
>> be approximately 10x more (instead 1h make it 10h).
>>
>> It would be more intuitive than specifying some numbers like IOPS,
>> etc.
TM> In any case you''re still going to have to provide a tunable for
this
TM> even if the resulting algorithm works well on the host side. Keep in 
TM> mind that a scrub can also impact the array(s) you''re filesystem
lives
TM> on. If all my ZFS systems started scrubbing at full speed - Because they
TM> thought they weren''t busy - at the same time it might cause
issues with
TM> other I/O on the array itself.

Tunable in a form of pool property, with default 100%.

On the other hand maybe simple algorithm Veritas has used is good
enough - simple delay between scrubing/resilvering some data.



-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Torrey McMahon

2006-Nov-13 04:07 UTC

head link

[zfs-discuss] # devices in raidz.

Robert Milkowski wrote:> Hello Torrey,
>
> Friday, November 10, 2006, 11:31:31 PM, you wrote:
>
> TM> Robert Milkowski wrote:
>   
>>>>> Also scrub can consume all CPU power on smaller and older
machines and
>>>>> that''s not always what I would like.
>>>>>
>>>>>
>>>>> REP> The big question, though, is "10% of
what?"  User CPU?  iops?
>>>>>       
>>>>>           
>>> AH> Probably N% of I/O Ops/Second would work well.
>>>
>>> Or if 100% means full speed, then 10% means that expected time
should
>>> be approximately 10x more (instead 1h make it 10h).
>>>
>>> It would be more intuitive than specifying some numbers like IOPS,
>>> etc.
>>>       
>
> TM> In any case you''re still going to have to provide a tunable
for this
> TM> even if the resulting algorithm works well on the host side. Keep in
> TM> mind that a scrub can also impact the array(s) you''re
filesystem lives
> TM> on. If all my ZFS systems started scrubbing at full speed - Because
they
> TM> thought they weren''t busy - at the same time it might cause
issues with
> TM> other I/O on the array itself.
>
> Tunable in a form of pool property, with default 100%.
>
> On the other hand maybe simple algorithm Veritas has used is good
> enough - simple delay between scrubing/resilvering some data.
I think a not-to-convoluted algorithm as people have suggested would be 
ideal and then let people override it as necessary. I would think a 100% 
default might be a call generator but I''m up for debate. ("Hey my
array
just went crazy. All the lights are blinking but my application isn''t 
doing any I/O. What gives?")

Robert Milkowski

2006-Nov-13 07:12 UTC

head link

[zfs-discuss] # devices in raidz.

Hello Torrey,

Monday, November 13, 2006, 5:07:02 AM, you wrote:

TM> Robert Milkowski wrote:>> Hello Torrey,
>>
>> Friday, November 10, 2006, 11:31:31 PM, you wrote:
>>
>> TM> Robert Milkowski wrote:
>>   
>>>>>> Also scrub can consume all CPU power on smaller and
older machines and
>>>>>> that''s not always what I would like.
>>>>>>
>>>>>>
>>>>>> REP> The big question, though, is "10% of
what?"  User CPU?  iops?
>>>>>>       
>>>>>>           
>>>> AH> Probably N% of I/O Ops/Second would work well.
>>>>
>>>> Or if 100% means full speed, then 10% means that expected time
should
>>>> be approximately 10x more (instead 1h make it 10h).
>>>>
>>>> It would be more intuitive than specifying some numbers like
IOPS,
>>>> etc.
>>>>       
>>
>> TM> In any case you''re still going to have to provide a
tunable for this
>> TM> even if the resulting algorithm works well on the host side.
Keep in
>> TM> mind that a scrub can also impact the array(s) you''re
filesystem lives
>> TM> on. If all my ZFS systems started scrubbing at full speed -
Because they
>> TM> thought they weren''t busy - at the same time it might
cause issues with
>> TM> other I/O on the array itself.
>>
>> Tunable in a form of pool property, with default 100%.
>>
>> On the other hand maybe simple algorithm Veritas has used is good
>> enough - simple delay between scrubing/resilvering some data.
TM> I think a not-to-convoluted algorithm as people have suggested would be
TM> ideal and then let people override it as necessary. I would think a 100%
TM> default might be a call generator but I''m up for debate.
("Hey my array
TM> just went crazy. All the lights are blinking but my application
isn''t
TM> doing any I/O. What gives?")

You''ve got the same behavior with any LVM when you replace a disk.
So it''s not something unexpected for admins. Also most of the time
they expect LVM to resilver ASAP. With default setting not being 100%
you''ll definitely see people complaining ZFS is slooow, etc.


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Torrey McMahon

2006-Nov-13 18:08 UTC

head link

[zfs-discuss] # devices in raidz.

Howdy Robert.

Robert Milkowski wrote:>
> You''ve got the same behavior with any LVM when you replace a disk.
> So it''s not something unexpected for admins. Also most of the time
> they expect LVM to resilver ASAP. With default setting not being 100%
> you''ll definitely see people complaining ZFS is slooow, etc.
>   

It''s quite possible that I''ve only seen the other side of the
coin but
in my past I''ve had support calls where
customers complained that they {replaced a drive, resilvered a mirror, 
... } and it knocked the performance of other things. My fave was a set 
of A5200s on a hub and after they cranked the i/o rate up on the mirror 
it caused some other app - Me thinks it was Oracle - get too slow, think 
there was a disk problem, crash(!), and then initiate a cluster 
failover. Given the disk group was not in perfect health....oh the fun 
we had.

In any case the key is documenting the behavior well enough so people 
can see what is going on, how to tune it slower or faster on the fly, 
etc. I''m more concerned with that then the actual algorithm or method
used.

Victor Latushkin

2006-Nov-13 19:30 UTC

head link

[zfs-discuss] # devices in raidz.

>> Maybe something like the "slow" parameter of VxVM?
>>
>>                slow[=iodelay]
>>                     Reduces toe system performance impact of copy
>>                     operations.  Such operations are usually per-
>>                     formed on small regions of the  volume  (nor-
>>                     mally  from  16  kilobytes to 128 kilobytes).
>>                     This  option  inserts  a  delay  between  the
>>                     recovery  of  each  such  region . A specific
>>                     delay can be  specified  with  iodelay  as  a
>>                     number  of milliseconds; otherwise, a default
>>                     is chosen (normally 250 milliseconds).
> 
> For modern machines, which *should* be the design point, the channel
> bandwidth is underutilized, so why not use it?
> 
> NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
> to resilver a single 500 GByte drive -- feeling lucky?  In the bad old
> days when disks were small, and the systems were slow, this made some
> sense.  The better approach is for the file system to do what it needs
> to do as efficiently as possible, which is the current state of ZFS.
Well, we are trying to balance impact of resilvering on running 
applications with a speed of resilvering.

I think that having an option to tell filesystem to postpone 
full-throttle resilvering till some quieter period of time may help. 
This may be combined with some throttling mechanism so during quieter 
period resilvering is done with full speed, and during busy period it 
may continue with reduced speed. Such arrangement may be useful for 
customers with e.g. well-defined SLAs.

Wbr,
Victor

Mike Seda

2006-Nov-13 20:38 UTC

head link

[zfs-discuss] # devices in raidz.

Hi All,
 From reading the docs, it seems that you can add devices (non-spares) 
to a zpool, but you cannot take them away, right?
Best,
Mike


Victor Latushkin wrote:>>> Maybe something like the "slow" parameter of VxVM?
>>>
>>>                slow[=iodelay]
>>>                     Reduces toe system performance impact of copy
>>>                     operations.  Such operations are usually per-
>>>                     formed on small regions of the  volume  (nor-
>>>                     mally  from  16  kilobytes to 128 kilobytes).
>>>                     This  option  inserts  a  delay  between  the
>>>                     recovery  of  each  such  region . A specific
>>>                     delay can be  specified  with  iodelay  as  a
>>>                     number  of milliseconds; otherwise, a default
>>>                     is chosen (normally 250 milliseconds).
>>
>> For modern machines, which *should* be the design point, the channel
>> bandwidth is underutilized, so why not use it?
>>
>> NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
>> to resilver a single 500 GByte drive -- feeling lucky?  In the bad old
>> days when disks were small, and the systems were slow, this made some
>> sense.  The better approach is for the file system to do what it needs
>> to do as efficiently as possible, which is the current state of ZFS.
>
> Well, we are trying to balance impact of resilvering on running 
> applications with a speed of resilvering.
>
> I think that having an option to tell filesystem to postpone 
> full-throttle resilvering till some quieter period of time may help. 
> This may be combined with some throttling mechanism so during quieter 
> period resilvering is done with full speed, and during busy period it 
> may continue with reduced speed. Such arrangement may be useful for 
> customers with e.g. well-defined SLAs.
>
> Wbr,
> Victor
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Cindy Swearingen

2006-Nov-13 23:35 UTC

head link

[zfs-discuss] # devices in raidz.

Hi Mike,

Yes, outside of the hot-spares feature, you can detach, offline, and 
replace existing devices in a pool, but you can''t remove devices, yet.

This feature work is being tracked under this RFE:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783

Cindy

Mike Seda wrote:> Hi All,
>  From reading the docs, it seems that you can add devices (non-spares) 
> to a zpool, but you cannot take them away, right?
> Best,
> Mike
> 
> 
> Victor Latushkin wrote:
> 
>>>> Maybe something like the "slow" parameter of VxVM?
>>>>
>>>>                slow[=iodelay]
>>>>                     Reduces toe system performance impact of
copy
>>>>                     operations.  Such operations are usually
per-
>>>>                     formed on small regions of the  volume 
(nor-
>>>>                     mally  from  16  kilobytes to 128
kilobytes).
>>>>                     This  option  inserts  a  delay  between 
the
>>>>                     recovery  of  each  such  region . A
specific
>>>>                     delay can be  specified  with  iodelay  as 
a
>>>>                     number  of milliseconds; otherwise, a
default
>>>>                     is chosen (normally 250 milliseconds).
>>>
>>>
>>> For modern machines, which *should* be the design point, the
channel
>>> bandwidth is underutilized, so why not use it?
>>>
>>> NB. At 4 128kByte iops per second, it would take 11 days and 8
hours
>>> to resilver a single 500 GByte drive -- feeling lucky?  In the bad
old
>>> days when disks were small, and the systems were slow, this made
some
>>> sense.  The better approach is for the file system to do what it
needs
>>> to do as efficiently as possible, which is the current state of
ZFS.
>>
>>
>> Well, we are trying to balance impact of resilvering on running 
>> applications with a speed of resilvering.
>>
>> I think that having an option to tell filesystem to postpone 
>> full-throttle resilvering till some quieter period of time may help. 
>> This may be combined with some throttling mechanism so during quieter 
>> period resilvering is done with full speed, and during busy period it 
>> may continue with reduced speed. Such arrangement may be useful for 
>> customers with e.g. well-defined SLAs.
>>
>> Wbr,
>> Victor
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Richard Elling - PAE

2006-Nov-14 05:55 UTC

head link

[zfs-discuss] # devices in raidz.

Torrey McMahon wrote:> Robert Milkowski wrote:
>> Hello Torrey,
>>
>> Friday, November 10, 2006, 11:31:31 PM, you wrote:
>>
>> TM> Robert Milkowski wrote:
>>  
>>>>>> Also scrub can consume all CPU power on smaller and
older
>>>>>> machines and
>>>>>> that''s not always what I would like.
>>>>>>
>>>>>>
>>>>>> REP> The big question, though, is "10% of
what?"  User CPU?  iops?
>>>>>>                 
>>>> AH> Probably N% of I/O Ops/Second would work well.
>>>>
>>>> Or if 100% means full speed, then 10% means that expected time
should
>>>> be approximately 10x more (instead 1h make it 10h).
>>>>
>>>> It would be more intuitive than specifying some numbers like
IOPS,
>>>> etc.
>>>>       
>>
>> TM> In any case you''re still going to have to provide a
tunable for
>> this TM> even if the resulting algorithm works well on the host
side.
>> Keep in TM> mind that a scrub can also impact the array(s)
you''re
>> filesystem lives
>> TM> on. If all my ZFS systems started scrubbing at full speed - 
>> Because they
>> TM> thought they weren''t busy - at the same time it might
cause
>> issues with
>> TM> other I/O on the array itself.
>>
>> Tunable in a form of pool property, with default 100%.
>>
>> On the other hand maybe simple algorithm Veritas has used is good
>> enough - simple delay between scrubing/resilvering some data.
>
> I think a not-to-convoluted algorithm as people have suggested would 
> be ideal and then let people override it as necessary. I would think a 
> 100% default might be a call generator but I''m up for debate.
("Hey my
> array just went crazy. All the lights are blinking but my application 
> isn''t doing any I/O. What gives?")
I''ll argue that *any* random % is bogus.  What you really want to
do is prioritize activity where resources are constrained.  From a RAS
perspective, idle systems are the devil''s playground :-).  ZFS already
does prioritize I/O that it knows about.  Prioritizing on CPU might have
some merit, but to integrate into Solaris'' resource management system
might bring some added system admin complexity which is unwanted.
 -- richard

Torrey McMahon

2006-Nov-15 01:25 UTC

head link

[zfs-discuss] # devices in raidz.

Richard Elling - PAE wrote:> Torrey McMahon wrote:
>> Robert Milkowski wrote:
>>> Hello Torrey,
>>>
>>> Friday, November 10, 2006, 11:31:31 PM, you wrote:
>>>
>>> [SNIP]
>>>
>>> Tunable in a form of pool property, with default 100%.
>>>
>>> On the other hand maybe simple algorithm Veritas has used is good
>>> enough - simple delay between scrubing/resilvering some data.
>>
>> I think a not-to-convoluted algorithm as people have suggested would 
>> be ideal and then let people override it as necessary. I would think 
>> a 100% default might be a call generator but I''m up for
debate. ("Hey
>> my array just went crazy. All the lights are blinking but my 
>> application isn''t doing any I/O. What gives?")
>
> I''ll argue that *any* random % is bogus.  What you really want to
> do is prioritize activity where resources are constrained.  From a RAS
> perspective, idle systems are the devil''s playground :-).  ZFS
already
> does prioritize I/O that it knows about.  Prioritizing on CPU might have
> some merit, but to integrate into Solaris'' resource management
system
> might bring some added system admin complexity which is unwanted.

I agree but the problem as I see it as that nothing has a overview of 
the entire environment. ZFS knows what I/O is coming in and what its 
sending out but that''s it. Even if we had an easy to use resource 
management framework across all the Sun applications and devices we''d 
still run into non-Sun bits that place demands on shared components like 
networking, san, arrays, etc. Anything that can be auto-tuned is great 
but I''m afraid we''re still going to need manual tuning in some
cases.

Richard Elling - PAE

2006-Nov-15 21:00 UTC

head link

[zfs-discuss] # devices in raidz.

Torrey McMahon wrote:> Richard Elling - PAE wrote:
>> Torrey McMahon wrote:
>>> Robert Milkowski wrote:
>>>> Hello Torrey,
>>>>
>>>> Friday, November 10, 2006, 11:31:31 PM, you wrote:
>>>>
>>>> [SNIP]
>>>>
>>>> Tunable in a form of pool property, with default 100%.
>>>>
>>>> On the other hand maybe simple algorithm Veritas has used is
good
>>>> enough - simple delay between scrubing/resilvering some data.
>>>
>>> I think a not-to-convoluted algorithm as people have suggested
would
>>> be ideal and then let people override it as necessary. I would
think
>>> a 100% default might be a call generator but I''m up for
debate.
>>> ("Hey my array just went crazy. All the lights are blinking
but my
>>> application isn''t doing any I/O. What gives?")
>>
>> I''ll argue that *any* random % is bogus.  What you really want
to
>> do is prioritize activity where resources are constrained.  From a RAS
>> perspective, idle systems are the devil''s playground :-).  ZFS
already
>> does prioritize I/O that it knows about.  Prioritizing on CPU might
have
>> some merit, but to integrate into Solaris'' resource management
system
>> might bring some added system admin complexity which is unwanted.
>
>
> I agree but the problem as I see it as that nothing has a overview of 
> the entire environment. ZFS knows what I/O is coming in and what its 
> sending out but that''s it. Even if we had an easy to use resource 
> management framework across all the Sun applications and devices
we''d
> still run into non-Sun bits that place demands on shared components 
> like networking, san, arrays, etc. Anything that can be auto-tuned is 
> great but I''m afraid we''re still going to need manual
tuning in some
> cases.I think this is reason #7429 why I hate SANs: no meaningful QoS
related to reason #85823 why I hate SANs: sdd_max_throttle is a 
butt-ugly hack
:-)
 -- richard

Mike Seda

2007-Apr-10 19:27 UTC

head link

[zfs-discuss] # devices in raidz.

I noticed that there is still an open bug regarding removing devices 
from a zpool:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783
Does anyone know if or when this feature will be implemented?


Cindy Swearingen wrote:> Hi Mike,
>
> Yes, outside of the hot-spares feature, you can detach, offline, and 
> replace existing devices in a pool, but you can''t remove devices,
yet.
>
> This feature work is being tracked under this RFE:
>
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783
>
> Cindy
>
> Mike Seda wrote:
>> Hi All,
>>  From reading the docs, it seems that you can add devices 
>> (non-spares) to a zpool, but you cannot take them away, right?
>> Best,
>> Mike
>>
>>
>> Victor Latushkin wrote:
>>
>>>>> Maybe something like the "slow" parameter of
VxVM?
>>>>>
>>>>>                slow[=iodelay]
>>>>>                     Reduces toe system performance impact
of copy
>>>>>                     operations.  Such operations are
usually per-
>>>>>                     formed on small regions of the  volume 
(nor-
>>>>>                     mally  from  16  kilobytes to 128
kilobytes).
>>>>>                     This  option  inserts  a  delay 
between  the
>>>>>                     recovery  of  each  such  region . A
specific
>>>>>                     delay can be  specified  with  iodelay 
as  a
>>>>>                     number  of milliseconds; otherwise, a
default
>>>>>                     is chosen (normally 250 milliseconds).
>>>>
>>>>
>>>> For modern machines, which *should* be the design point, the
channel
>>>> bandwidth is underutilized, so why not use it?
>>>>
>>>> NB. At 4 128kByte iops per second, it would take 11 days and 8
hours
>>>> to resilver a single 500 GByte drive -- feeling lucky?  In the
bad old
>>>> days when disks were small, and the systems were slow, this
made some
>>>> sense.  The better approach is for the file system to do what
it needs
>>>> to do as efficiently as possible, which is the current state of
ZFS.
>>>
>>>
>>> Well, we are trying to balance impact of resilvering on running 
>>> applications with a speed of resilvering.
>>>
>>> I think that having an option to tell filesystem to postpone 
>>> full-throttle resilvering till some quieter period of time may
help.
>>> This may be combined with some throttling mechanism so during 
>>> quieter period resilvering is done with full speed, and during busy
>>> period it may continue with reduced speed. Such arrangement may be 
>>> useful for customers with e.g. well-defined SLAs.
>>>
>>> Wbr,
>>> Victor
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Cindy.Swearingen at Sun.COM

2007-Apr-11 22:30 UTC

head link

[zfs-discuss] # devices in raidz.

Mike,

This RFE is still being worked and I have no ETA on completion...

cs

Mike Seda wrote:> I noticed that there is still an open bug regarding removing devices 
> from a zpool:
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783
> Does anyone know if or when this feature will be implemented?
> 
> 
> Cindy Swearingen wrote:
> 
>> Hi Mike,
>>
>> Yes, outside of the hot-spares feature, you can detach, offline, and 
>> replace existing devices in a pool, but you can''t remove
devices, yet.
>>
>> This feature work is being tracked under this RFE:
>>
>> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783
>>
>> Cindy
>>
>> Mike Seda wrote:
>>
>>> Hi All,
>>>  From reading the docs, it seems that you can add devices 
>>> (non-spares) to a zpool, but you cannot take them away, right?
>>> Best,
>>> Mike
>>>
>>>
>>> Victor Latushkin wrote:
>>>
>>>>>> Maybe something like the "slow" parameter of
VxVM?
>>>>>>
>>>>>>                slow[=iodelay]
>>>>>>                     Reduces toe system performance
impact of copy
>>>>>>                     operations.  Such operations are
usually per-
>>>>>>                     formed on small regions of the 
volume  (nor-
>>>>>>                     mally  from  16  kilobytes to 128
kilobytes).
>>>>>>                     This  option  inserts  a  delay 
between  the
>>>>>>                     recovery  of  each  such  region .
A specific
>>>>>>                     delay can be  specified  with 
iodelay  as  a
>>>>>>                     number  of milliseconds; otherwise,
a default
>>>>>>                     is chosen (normally 250
milliseconds).
>>>>>
>>>>>
>>>>>
>>>>> For modern machines, which *should* be the design point,
the channel
>>>>> bandwidth is underutilized, so why not use it?
>>>>>
>>>>> NB. At 4 128kByte iops per second, it would take 11 days
and 8 hours
>>>>> to resilver a single 500 GByte drive -- feeling lucky?  In
the bad old
>>>>> days when disks were small, and the systems were slow, this
made some
>>>>> sense.  The better approach is for the file system to do
what it needs
>>>>> to do as efficiently as possible, which is the current
state of ZFS.
>>>>
>>>>
>>>>
>>>> Well, we are trying to balance impact of resilvering on running
>>>> applications with a speed of resilvering.
>>>>
>>>> I think that having an option to tell filesystem to postpone 
>>>> full-throttle resilvering till some quieter period of time may
help.
>>>> This may be combined with some throttling mechanism so during 
>>>> quieter period resilvering is done with full speed, and during
busy
>>>> period it may continue with reduced speed. Such arrangement may
be
>>>> useful for customers with e.g. well-defined SLAs.
>>>>
>>>> Wbr,
>>>> Victor
>>>> _______________________________________________
>>>> zfs-discuss mailing list
>>>> zfs-discuss at opensolaris.org
>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>
>>>
>>>
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
>

Maybe Matching Threads

Search for more apparently analagous threads

zfs discuss - Nov 2006 - # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

[zfs-discuss] # devices in raidz.

Maybe Matching Threads