thr3ads.net - zfs discuss - [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)? [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2012-Jan-06 23:33 UTC

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

Hello all,

I have a new idea up for discussion.

Several RAID systems have implemented "spread" spare drives
in the sense that there is not an idling disk waiting to
receive a burst of resilver data filling it up, but the
capacity of the spare disk is spread among all drives in
the array. As a result, the healthy array gets one more
spindle and works a little faster, and rebuild times are
often decreased since more spindles can participate in
repairs at the same time.

I don''t think I''ve seen such idea proposed for ZFS, and
I do wonder if it is at all possible with variable-width
stripes? Although if the disk is sliced in 200 metaslabs
or so, implementing a spread-spare is a no-brainer as well.

To be honest, I''ve seen this a long time ago in (Falcon?)
RAID controllers, and recently - in a USEnix presentation
of IBM GPFS on YouTube. In the latter the speaker goes
a greater depth describing how their "declustered RAID"
approach (as they call it: all blocks - spare, redundancy
and data are intermixed evenly on all drives and not in
a single "group" or a mid-level VDEV as would be for ZFS).

http://www.youtube.com/watch?v=2g5rx4gP6yU&feature=related

GPFS with declustered RAID not only decreases rebuild
times and/or impact of rebuilds on end-user operations,
but it also happens to increase reliability - there is
a smaller time window in case of multiple-disk failure
in a large RAID-6 or RAID-7 array (in the example they
use 47-disk sets) that the data is left in a "critical
state" due to lack of redundancy, and there is less data
overall in such state - so the system goes from critical
to simply degraded (with some redundancy) in a few minutes.

Another thing they have in GPFS is temporary offlining
of disks so that they can catch up when reattached - only
newer writes (bigger TXG numbers in ZFS terms) are added to
reinserted disks. I am not sure this exists in ZFS today,
either. This might simplify physical systems maintenance
(as it does for IBM boxes - see presentation if interested)
and quick recovery from temporarily unavailable disks, such
as when a disk gets a bus reset and is unavailable for writes
for a few seconds (or more) while the array keeps on writing.

I find these ideas cool. I do believe that IBM might get
angry if ZFS development copy-pasted them "as is", but it
might get nonetheless get us inventing a similar wheel
that would be a bit different ;)
There are already several vendors doing this in some way,
so perhaps there is no (patent) monopoly in place already...

And I think all the magic of spread spares and/or "declustered
RAID" would go into just making another write-block allocator
in the same league "raidz" or "mirror" are nowadays...
BTW, are such allocators pluggable (as software modules)?

What do you think - can and should such ideas find their
way into ZFS? Or why not? Perhaps from theoretical or
real-life experience with such storage approaches?

//Jim Klimov

Bob Friesenhahn

2012-Jan-08 00:28 UTC

head link

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

On Sat, 7 Jan 2012, Jim Klimov wrote:>
> Several RAID systems have implemented "spread" spare drives
> in the sense that there is not an idling disk waiting to
> receive a burst of resilver data filling it up, but the
> capacity of the spare disk is spread among all drives in
> the array. As a result, the healthy array gets one more
> spindle and works a little faster, and rebuild times are
> often decreased since more spindles can participate in
> repairs at the same time.
I think that I would also be interested in a system which uses the 
so-called spare disks for more protective redundancy but then reduces 
that protective redundancy in order to use that disk to replace a 
failed disk or to automatically enlarge the pool.

For example, a pool could start out with four-way mirroring when there 
is little data in the pool.  When the pool becomes more full, mirror 
devices are automatically removed (from existing vdevs), and used to 
add more vdevs.  Eventually a limit would be hit so that no more 
mirrors are allowed to be removed.

Obviously this approach works with simple mirrors but not for raidz.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2012-Jan-08 01:37 UTC

head link

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

Hi Jim,

On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote:
> Hello all,
> 
> I have a new idea up for discussion.
> 
> Several RAID systems have implemented "spread" spare drives
> in the sense that there is not an idling disk waiting to
> receive a burst of resilver data filling it up, but the
> capacity of the spare disk is spread among all drives in
> the array. As a result, the healthy array gets one more
> spindle and works a little faster, and rebuild times are
> often decreased since more spindles can participate in
> repairs at the same time.
Xiotech has a distributed, relocatable model, but the FRU is the whole ISE.
There have been other implementations of more distributed RAIDness in the
past (RAID-1E, etc). 

The big question is whether they are worth the effort. Spares solve a
serviceability
problem and only impact availability in an indirect manner. For single-parity 
solutions, spares can make a big difference in MTTDL, but have almost no impact
on MTTDL for double-parity solutions (eg. raidz2).
> I don''t think I''ve seen such idea proposed for ZFS, and
> I do wonder if it is at all possible with variable-width
> stripes? Although if the disk is sliced in 200 metaslabs
> or so, implementing a spread-spare is a no-brainer as well.
Put some thoughts down on paper and work through the math. If it all works
out, let''s implement it!
 -- richard
> 
> To be honest, I''ve seen this a long time ago in (Falcon?)
> RAID controllers, and recently - in a USEnix presentation
> of IBM GPFS on YouTube. In the latter the speaker goes
> a greater depth describing how their "declustered RAID"
> approach (as they call it: all blocks - spare, redundancy
> and data are intermixed evenly on all drives and not in
> a single "group" or a mid-level VDEV as would be for ZFS).
> 
> http://www.youtube.com/watch?v=2g5rx4gP6yU&feature=related
> 
> GPFS with declustered RAID not only decreases rebuild
> times and/or impact of rebuilds on end-user operations,
> but it also happens to increase reliability - there is
> a smaller time window in case of multiple-disk failure
> in a large RAID-6 or RAID-7 array (in the example they
> use 47-disk sets) that the data is left in a "critical
> state" due to lack of redundancy, and there is less data
> overall in such state - so the system goes from critical
> to simply degraded (with some redundancy) in a few minutes.
> 
> Another thing they have in GPFS is temporary offlining
> of disks so that they can catch up when reattached - only
> newer writes (bigger TXG numbers in ZFS terms) are added to
> reinserted disks. I am not sure this exists in ZFS today,
> either. This might simplify physical systems maintenance
> (as it does for IBM boxes - see presentation if interested)
> and quick recovery from temporarily unavailable disks, such
> as when a disk gets a bus reset and is unavailable for writes
> for a few seconds (or more) while the array keeps on writing.
> 
> I find these ideas cool. I do believe that IBM might get
> angry if ZFS development copy-pasted them "as is", but it
> might get nonetheless get us inventing a similar wheel
> that would be a bit different ;)
> There are already several vendors doing this in some way,
> so perhaps there is no (patent) monopoly in place already...
> 
> And I think all the magic of spread spares and/or "declustered
> RAID" would go into just making another write-block allocator
> in the same league "raidz" or "mirror" are nowadays...
> BTW, are such allocators pluggable (as software modules)?
> 
> What do you think - can and should such ideas find their
> way into ZFS? Or why not? Perhaps from theoretical or
> real-life experience with such storage approaches?
> 
> //Jim Klimov
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 

ZFS and performance consulting
http://www.RichardElling.com
illumos meetup, Jan 10, 2012, Menlo Park, CA
http://www.meetup.com/illumos-User-Group/events/41665962/

Jim Klimov

2012-Jan-08 02:59 UTC

head link

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-08 5:37, Richard Elling ?????:> The big question is whether they are worth the effort. Spares solve a
serviceability
> problem and only impact availability in an indirect manner. For
single-parity
> solutions, spares can make a big difference in MTTDL, but have almost no
impact
> on MTTDL for double-parity solutions (eg. raidz2).
Well, regarding this part: in the presentation linked in my OP,
the IBM presenter suggests that for a 6-disk raid10 (3 mirrors)
with one spare drive, overall a 7-disk set, there are such
options for "critical" hits to data redundancy when one of
drives dies:

1) Traditional RAID - one full disk is a mirror of another
    full disk; 100% of a disk''s size is "critical" and has to
    be prelicated into a spare drive ASAP;

2) Declustered RAID - all 7 disks are used for 2 unique data
    blocks from "original" setup and one spare block (I am not
    sure I described it well in words, his diagram shows it
    better); if a single disk dies, only 1/7 worth of disk
    size is critical (not redundant) and can be fixed faster.

    For their typical 47-disk sets of RAID-7-like redundancy,
    under 1% of data becomes critical when 3 disks die at once,
    which is (deemed) unlikely as is.

Apparently, in the GPFS layout, MTTDL is much higher than
in raid10+spare with all other stats being similar.

I am not sure I''m ready (or qualified) to sit down and present
the math right now - I just heard some ideas that I considered
worth sharing and discussing ;)

Thanks for the input,
//Jim

Tim Cook

2012-Jan-08 03:15 UTC

head link

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

On Sat, Jan 7, 2012 at 7:37 PM, Richard Elling <richard.elling at
gmail.com>wrote:
> Hi Jim,
>
> On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote:
>
> > Hello all,
> >
> > I have a new idea up for discussion.
> >
> > Several RAID systems have implemented "spread" spare drives
> > in the sense that there is not an idling disk waiting to
> > receive a burst of resilver data filling it up, but the
> > capacity of the spare disk is spread among all drives in
> > the array. As a result, the healthy array gets one more
> > spindle and works a little faster, and rebuild times are
> > often decreased since more spindles can participate in
> > repairs at the same time.
>
> Xiotech has a distributed, relocatable model, but the FRU is the whole ISE.
> There have been other implementations of more distributed RAIDness in the
> past (RAID-1E, etc).
>
> The big question is whether they are worth the effort. Spares solve a
> serviceability
> problem and only impact availability in an indirect manner. For
> single-parity
> solutions, spares can make a big difference in MTTDL, but have almost no
> impact
> on MTTDL for double-parity solutions (eg. raidz2).
>

I disagree.  Dedicated spares impact far more than availability.  During a
rebuild performance is, in general, abysmal.  ZIL and L2ARC will obviously
help (L2ARC more than ZIL), but at the end of the day, if we''ve got a
12
hour rebuild (fairly conservative in the days of 2TB SATA drives), the
performance degradation is going to be very real for end-users.  With
distributed parity and spares, you should in theory be able to cut this
down an order of magnitude.  I feel as though you''re brushing this off
as
not a big deal when it''s an EXTREMELY big deal (in my mind).  In my
opinion
you can''t just approach this from an MTTDL perspective, you also need
to
take into account user experience.  Just because I haven''t lost data,
doesn''t mean the system isn''t (essentially) unavailable (sorry
for the
double negative and repeated parenthesis).  If I can''t use the system
due
to performance being a fraction of what it is during normal production, it
might as well be an outage.



>
> > I don''t think I''ve seen such idea proposed for ZFS,
and
> > I do wonder if it is at all possible with variable-width
> > stripes? Although if the disk is sliced in 200 metaslabs
> > or so, implementing a spread-spare is a no-brainer as well.
>
> Put some thoughts down on paper and work through the math. If it all works
> out, let''s implement it!
>  -- richard
>
>I realize it''s not intentional Richard, but that response is more than
a
bit condescending.  If he could just put it down on paper and code
something up, I strongly doubt he would be posting his thoughts here.  He
would be posting results.  The intention of his post, as far as I can tell,
is to perhaps inspire someone who CAN just write down the math and write up
the code to do so.  Or at least to have them review his thoughts and give
him a dev''s perspective on how viable bringing something like this to
ZFS
is.  I fear responses like "the code is there, figure it out" makes
the
*aris community no better than the linux one.



> >
> > To be honest, I''ve seen this a long time ago in (Falcon?)
> > RAID controllers, and recently - in a USEnix presentation
> > of IBM GPFS on YouTube. In the latter the speaker goes
> > a greater depth describing how their "declustered RAID"
> > approach (as they call it: all blocks - spare, redundancy
> > and data are intermixed evenly on all drives and not in
> > a single "group" or a mid-level VDEV as would be for ZFS).
> >
> > http://www.youtube.com/watch?v=2g5rx4gP6yU&feature=related
> >
> > GPFS with declustered RAID not only decreases rebuild
> > times and/or impact of rebuilds on end-user operations,
> > but it also happens to increase reliability - there is
> > a smaller time window in case of multiple-disk failure
> > in a large RAID-6 or RAID-7 array (in the example they
> > use 47-disk sets) that the data is left in a "critical
> > state" due to lack of redundancy, and there is less data
> > overall in such state - so the system goes from critical
> > to simply degraded (with some redundancy) in a few minutes.
> >
> > Another thing they have in GPFS is temporary offlining
> > of disks so that they can catch up when reattached - only
> > newer writes (bigger TXG numbers in ZFS terms) are added to
> > reinserted disks. I am not sure this exists in ZFS today,
> > either. This might simplify physical systems maintenance
> > (as it does for IBM boxes - see presentation if interested)
> > and quick recovery from temporarily unavailable disks, such
> > as when a disk gets a bus reset and is unavailable for writes
> > for a few seconds (or more) while the array keeps on writing.
> >
> > I find these ideas cool. I do believe that IBM might get
> > angry if ZFS development copy-pasted them "as is", but it
> > might get nonetheless get us inventing a similar wheel
> > that would be a bit different ;)
> > There are already several vendors doing this in some way,
> > so perhaps there is no (patent) monopoly in place already...
> >
> > And I think all the magic of spread spares and/or "declustered
> > RAID" would go into just making another write-block allocator
> > in the same league "raidz" or "mirror" are
nowadays...
> > BTW, are such allocators pluggable (as software modules)?
> >
> > What do you think - can and should such ideas find their
> > way into ZFS? Or why not? Perhaps from theoretical or
> > real-life experience with such storage approaches?
> >
> > //Jim Klimov
> >
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> --
>
> ZFS and performance consulting
> http://www.RichardElling.com
> illumos meetup, Jan 10, 2012, Menlo Park, CA
> http://www.meetup.com/illumos-User-Group/events/41665962/
>
>
As always, feel free to tell me why my rant is completely off base ;)

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120107/65d425ff/attachment.html>

Jim Klimov

2012-Jan-08 14:29 UTC

head link

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

First of all, I would like to thank Bob, Richard and Tim for
at least taking time to look at this proposal and responding ;)

It is also encouraging to see that 2 of 3 responders consider
this idea at least worth pondering and discussng, as it appeals
to their direct interest. Even Richard was not dismissive of it ;)

Finally, as Tim was right to note, I am not a kernel developer
(and won''t become one as good as those present on this list).
Of course, I could "pull the blanket onto my side" and say
that I''d try to write that code myself... but it would
probably be a long wait, like that for "BP rewrite" - because,
I already have quite a few commitments and responsibilities
as an admin and recently as a parent (yay!)

So, I guess, my piece of the pie is currently limited to RFEs
and bug reports... and working in IT for a software development
company, I believe (or hope) that''s not a useless part of the
process ;)

I do believe that ZFS technology is amazing - despite some
shortcomings that are still present - and I do want to see
it flourish... ASAP! :^)

//Jim

2012-01-08 7:15, Tim Cook wrote:>
>
> On Sat, Jan 7, 2012 at 7:37 PM, Richard Elling <richard.elling at
gmail.com
> <mailto:richard.elling at gmail.com>> wrote:
>
>     Hi Jim,
>
>     On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote:
>
>      > Hello all,
>      >
>      > I have a new idea up for discussion.
>      >
...>
>
> I disagree.  Dedicated spares impact far more than availability.  During
> a rebuild performance is, in general, abysmal.  ...
>   If I can''t use the system due to performance being a fraction of
what
> it is during normal production, it might as well be an outage.
>
>
>
>      > I don''t think I''ve seen such idea proposed for
ZFS, and
>      > I do wonder if it is at all possible with variable-width
>      > stripes? Although if the disk is sliced in 200 metaslabs
>      > or so, implementing a spread-spare is a no-brainer as well.
>
>     Put some thoughts down on paper and work through the math. If it all
>     works
>     out, let''s implement it!
>       -- richard
>
>
> I realize it''s not intentional Richard, but that response is more
than a
> bit condescending.  If he could just put it down on paper and code
> something up, I strongly doubt he would be posting his thoughts here.
>   He would be posting results.  The intention of his post, as far as I
> can tell, is to perhaps inspire someone who CAN just write down the math
> and write up the code to do so.  Or at least to have them review his
> thoughts and give him a dev''s perspective on how viable bringing
> something like this to ZFS is.  I fear responses like "the code is
> there, figure it out" makes the *aris community no better than the
linux
> one.
>
>      >
>      > What do you think - can and should such ideas find their
>      > way into ZFS? Or why not? Perhaps from theoretical or
>      > real-life experience with such storage approaches?
>      >
>      > //Jim Klimov
>
> As always, feel free to tell me why my rant is completely off base ;)
>
> --Tim

Pasi Kärkkäinen

2012-Jan-08 20:12 UTC

head link

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

On Sun, Jan 08, 2012 at 06:59:57AM +0400, Jim Klimov
wrote:> 2012-01-08 5:37, Richard Elling ??????????:
>> The big question is whether they are worth the effort. Spares solve a
serviceability
>> problem and only impact availability in an indirect manner. For
single-parity
>> solutions, spares can make a big difference in MTTDL, but have almost
no impact
>> on MTTDL for double-parity solutions (eg. raidz2).
>
> Well, regarding this part: in the presentation linked in my OP,
> the IBM presenter suggests that for a 6-disk raid10 (3 mirrors)
> with one spare drive, overall a 7-disk set, there are such
> options for "critical" hits to data redundancy when one of
> drives dies:
>
> 1) Traditional RAID - one full disk is a mirror of another
>    full disk; 100% of a disk''s size is "critical" and
has to
>    be prelicated into a spare drive ASAP;
>
> 2) Declustered RAID - all 7 disks are used for 2 unique data
>    blocks from "original" setup and one spare block (I am not
>    sure I described it well in words, his diagram shows it
>    better); if a single disk dies, only 1/7 worth of disk
>    size is critical (not redundant) and can be fixed faster.
>
>    For their typical 47-disk sets of RAID-7-like redundancy,
>    under 1% of data becomes critical when 3 disks die at once,
>    which is (deemed) unlikely as is.
>
> Apparently, in the GPFS layout, MTTDL is much higher than
> in raid10+spare with all other stats being similar.
>
> I am not sure I''m ready (or qualified) to sit down and present
> the math right now - I just heard some ideas that I considered
> worth sharing and discussing ;)
>
Thanks for the video link (http://www.youtube.com/watch?v=2g5rx4gP6yU). 
It''s very interesting!

GPFS Native RAID seems to be more advanced than current ZFS,
and it even has rebalancing implemented (the infamous missing zfs bp-rewrite).

It''d definitely be interesting to have something like this implemented
in ZFS.

-- Pasi

Richard Elling

2012-Jan-09 02:25 UTC

head link

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

Note: more analysis of the GPFS implementations is needed, but that will take
more
time than I''ll spend this evening :-) Quick hits below...

On Jan 7, 2012, at 7:15 PM, Tim Cook wrote:> On Sat, Jan 7, 2012 at 7:37 PM, Richard Elling <richard.elling at
gmail.com> wrote:
> Hi Jim,
> 
> On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote:
> 
> > Hello all,
> >
> > I have a new idea up for discussion.
> >
> > Several RAID systems have implemented "spread" spare drives
> > in the sense that there is not an idling disk waiting to
> > receive a burst of resilver data filling it up, but the
> > capacity of the spare disk is spread among all drives in
> > the array. As a result, the healthy array gets one more
> > spindle and works a little faster, and rebuild times are
> > often decreased since more spindles can participate in
> > repairs at the same time.
> 
> Xiotech has a distributed, relocatable model, but the FRU is the whole ISE.
> There have been other implementations of more distributed RAIDness in the
> past (RAID-1E, etc).
> 
> The big question is whether they are worth the effort. Spares solve a
serviceability
> problem and only impact availability in an indirect manner. For
single-parity
> solutions, spares can make a big difference in MTTDL, but have almost no
impact
> on MTTDL for double-parity solutions (eg. raidz2).
> 
> 
> I disagree.  Dedicated spares impact far more than availability.  During a
rebuild performance is, in general, abysmal.
In ZFS, there is a resilver throttle that is designed to ensure that resilvering
activity
does not impact interactive performance. Do you have data that suggests
otherwise?
>  ZIL and L2ARC will obviously help (L2ARC more than ZIL),
ZIL makes zero impact on resilver.  I''ll have to check to see if L2ARC
is still used, but
due to the nature of the ARC design, read-once workloads like backup or resilver
do
not tend to negatively impact frequently used data.
> but at the end of the day, if we''ve got a 12 hour rebuild (fairly
conservative in the days of 2TB
> SATA drives), the performance degradation is going to be very real for
end-users.
I''d like to see some data on this for modern ZFS implementations (post
Summer 2010)
> With distributed parity and spares, you should in theory be able to cut
this down an order of magnitude.
> I feel as though you''re brushing this off as not a big deal when
it''s an EXTREMELY big deal (in my mind).  In my opinion you
can''t just approach this from an MTTDL perspective, you also need to
take into account user experience.  Just because I haven''t lost data,
doesn''t mean the system isn''t (essentially) unavailable (sorry
for the double negative and repeated parenthesis).  If I can''t use the
system due to performance being a fraction of what it is during normal
production, it might as well be an outage.
So we have a method to analyze the ability of a system to perform during
degradation:
performability. This can be applied to computer systems and we''ve done
some analysis
specifically on RAID arrays. See also
http://www.springerlink.com/content/267851748348k382/
http://blogs.oracle.com/relling/tags/performability

Hence my comment about "doing some math" :-)
> > I don''t think I''ve seen such idea proposed for ZFS,
and
> > I do wonder if it is at all possible with variable-width
> > stripes? Although if the disk is sliced in 200 metaslabs
> > or so, implementing a spread-spare is a no-brainer as well.
> 
> Put some thoughts down on paper and work through the math. If it all works
> out, let''s implement it!
>  -- richard
> 
> 
> I realize it''s not intentional Richard, but that response is more
than a bit condescending.  If he could just put it down on paper and code
something up, I strongly doubt he would be posting his thoughts here.  He would
be posting results.  The intention of his post, as far as I can tell, is to
perhaps inspire someone who CAN just write down the math and write up the code
to do so.  Or at least to have them review his thoughts and give him a
dev''s perspective on how viable bringing something like this to ZFS is.
I fear responses like "the code is there, figure it out" makes the
*aris community no better than the linux one.
When I talk about spares in tutorials, we discuss various tradeoffs and how to
analyse
the systems. Interestingly, for the GPFS case, the mirrors example clearly shows
the
benefit of declustered RAID. However, the triple-parity example (similar to
raidz3) is
not so persuasive. If you have raidz3 + spares, then why not go ahead and do
raidz4?
In the tutorial we work through a raidz2 + spare vs raidz2 case and the raidz2
case
is better in both performance and dependability without sacrificing space (an
unusual
condition!)

It is not very difficult to add a raidz4 or indeed any number of additional
parity, but
there is a point of diminishing returns, usually when some other system
component
becomes more critical than the RAID protection. So, raidz4 + spare is less
dependable
than raidz5, and so on.
 -- richard
 > >
> > To be honest, I''ve seen this a long time ago in (Falcon?)
> > RAID controllers, and recently - in a USEnix presentation
> > of IBM GPFS on YouTube. In the latter the speaker goes
> > a greater depth describing how their "declustered RAID"
> > approach (as they call it: all blocks - spare, redundancy
> > and data are intermixed evenly on all drives and not in
> > a single "group" or a mid-level VDEV as would be for ZFS).
> >
> > http://www.youtube.com/watch?v=2g5rx4gP6yU&feature=related
> >
> > GPFS with declustered RAID not only decreases rebuild
> > times and/or impact of rebuilds on end-user operations,
> > but it also happens to increase reliability - there is
> > a smaller time window in case of multiple-disk failure
> > in a large RAID-6 or RAID-7 array (in the example they
> > use 47-disk sets) that the data is left in a "critical
> > state" due to lack of redundancy, and there is less data
> > overall in such state - so the system goes from critical
> > to simply degraded (with some redundancy) in a few minutes.
> >
> > Another thing they have in GPFS is temporary offlining
> > of disks so that they can catch up when reattached - only
> > newer writes (bigger TXG numbers in ZFS terms) are added to
> > reinserted disks. I am not sure this exists in ZFS today,
> > either. This might simplify physical systems maintenance
> > (as it does for IBM boxes - see presentation if interested)
> > and quick recovery from temporarily unavailable disks, such
> > as when a disk gets a bus reset and is unavailable for writes
> > for a few seconds (or more) while the array keeps on writing.
> >
> > I find these ideas cool. I do believe that IBM might get
> > angry if ZFS development copy-pasted them "as is", but it
> > might get nonetheless get us inventing a similar wheel
> > that would be a bit different ;)
> > There are already several vendors doing this in some way,
> > so perhaps there is no (patent) monopoly in place already...
> >
> > And I think all the magic of spread spares and/or "declustered
> > RAID" would go into just making another write-block allocator
> > in the same league "raidz" or "mirror" are
nowadays...
> > BTW, are such allocators pluggable (as software modules)?
> >
> > What do you think - can and should such ideas find their
> > way into ZFS? Or why not? Perhaps from theoretical or
> > real-life experience with such storage approaches?
> >
> > //Jim Klimov
> >
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> --
> 
> ZFS and performance consulting
> http://www.RichardElling.com
> illumos meetup, Jan 10, 2012, Menlo Park, CA
> http://www.meetup.com/illumos-User-Group/events/41665962/
> 
> 
> 
> As always, feel free to tell me why my rant is completely off base ;)
> 
> --Tim 
-- 

ZFS and performance consulting
http://www.RichardElling.com
illumos meetup, Jan 10, 2012, Menlo Park, CA
http://www.meetup.com/illumos-User-Group/events/41665962/

Jim Klimov

2012-Jan-09 02:47 UTC

head link

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-09 6:25, Richard Elling wrote:> Note: more analysis of the GPFS implementations is needed, but that will
take more
> time than I''ll spend this evening :-) Quick hits below...
Good to hear you might look into it after all ;)
>> but at the end of the day, if we''ve got a 12 hour rebuild
(fairly conservative in the days of 2TB
>> SATA drives), the performance degradation is going to be very real for
end-users.
>
> I''d like to see some data on this for modern ZFS implementations
(post Summer 2010)
>
Is "scrubbing performance" irrelevant in this discussion?
I think that in general, scrubbing is the read-half of
a larger rebuild process, at least for a single-vdev pool,
so rebuilds are about as long or worse. Am I wrong?

In my home-NAS case a raidz2 pool of six 2Tb drives, which
is filled 76%, consistently takes 85 hours to scrub.
No SSDs involved, no L2ARC, no ZILs. According to iostat,
the HDDs are often utilized to 100% with random IO load,
yielding from 500KBps to 2-3MBps in about 80-100IOPS per
disk (I have a scrub going on at this moment).

This system variably runs oi_148a (LiveUSB recovery) and
oi_151a when alive ;)

HTH,
//Jim Klimov

Karl Wagner

2012-Jan-10 17:26 UTC

head link

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

On Sun, January 8, 2012 00:28, Bob Friesenhahn wrote:>
> I think that I would also be interested in a system which uses the
> so-called spare disks for more protective redundancy but then reduces
> that protective redundancy in order to use that disk to replace a
> failed disk or to automatically enlarge the pool.
>
> For example, a pool could start out with four-way mirroring when there
> is little data in the pool.  When the pool becomes more full, mirror
> devices are automatically removed (from existing vdevs), and used to
> add more vdevs.  Eventually a limit would be hit so that no more
> mirrors are allowed to be removed.
>
> Obviously this approach works with simple mirrors but not for raidz.
>
> Bob
> --
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
I actually disagree about raidz. I have often thought that a "dynamic
raidz" would be a great feature.

For instance, you have a 4-way raidz. What you are saying is you want the
array to survive the loss of a single drive. So, from an empty vdev, it
starts by writing 2 copies of each block, effectively creating a pair of
mirrors. These are quicker to write and quicker to resilver than parity,
and you would likely get a read speed increase too.

As the vdev starts to get full, it starts using a parity based redundancy,
and converting "older" data to this as well. Performance drops a bit,
but
it happens slowly. In addition, any older blocks not yet converted are
still quicker to read and resilver.

This is only a theory, but it is certainly something which could be
considered. It would probably take a lot of rewriting of the raidz code,
though.

Daniel Carosone

2012-Jan-12 04:05 UTC

head link

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

On Sun, Jan 08, 2012 at 06:25:05PM -0800, Richard Elling
wrote:> ZIL makes zero impact on resilver.  I''ll have to check to see if
L2ARC is still used, but
> due to the nature of the ARC design, read-once workloads like backup or
resilver do
> not tend to negatively impact frequently used data.
This is true, in a strict sense (they don''t help resilver itself) but
it misses the point. They (can) help the system, when resilver is
underway. 

ZIL helps reduce the impact busy resilvering disks have on other system
operation (sync write syscalls and vfs ops by apps).  L2ARC, likewise
for reads.  Both can hide the latency increases that resilvering iops
cause for the disks (and which the throttle you mentioned also
attempts to minimise). 

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120112/d78e8edb/attachment.bin>

Daniel Carosone

2012-Jan-12 05:32 UTC

head link

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

On Thu, Jan 12, 2012 at 03:05:32PM +1100, Daniel Carosone
wrote:> On Sun, Jan 08, 2012 at 06:25:05PM -0800, Richard Elling wrote:
> > ZIL makes zero impact on resilver.  I''ll have to check to see
if L2ARC is still used, but
> > due to the nature of the ARC design, read-once workloads like backup
or resilver do
> > not tend to negatively impact frequently used data.
> 
> This is true, in a strict sense (they don''t help resilver itself)
but
> it misses the point. They (can) help the system, when resilver is
> underway. 
> 
> ZIL helps reduce the impact busy resilvering disks have on other system
Well, since I''m being strict and picky, I should of course say
ZIL-on-slog.
> operation (sync write syscalls and vfs ops by apps).  L2ARC, likewise
> for reads.  Both can hide the latency increases that resilvering iops
> cause for the disks (and which the throttle you mentioned also
> attempts to minimise). 
--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120112/48a33af6/attachment.bin>

zfs discuss - Jan 2012 - ZFS and spread-spares (kinda like GPFS declustered RAID)?

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?