Jim Klimov
2012-Jan-06 23:33 UTC
[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
Hello all, I have a new idea up for discussion. Several RAID systems have implemented "spread" spare drives in the sense that there is not an idling disk waiting to receive a burst of resilver data filling it up, but the capacity of the spare disk is spread among all drives in the array. As a result, the healthy array gets one more spindle and works a little faster, and rebuild times are often decreased since more spindles can participate in repairs at the same time. I don''t think I''ve seen such idea proposed for ZFS, and I do wonder if it is at all possible with variable-width stripes? Although if the disk is sliced in 200 metaslabs or so, implementing a spread-spare is a no-brainer as well. To be honest, I''ve seen this a long time ago in (Falcon?) RAID controllers, and recently - in a USEnix presentation of IBM GPFS on YouTube. In the latter the speaker goes a greater depth describing how their "declustered RAID" approach (as they call it: all blocks - spare, redundancy and data are intermixed evenly on all drives and not in a single "group" or a mid-level VDEV as would be for ZFS). http://www.youtube.com/watch?v=2g5rx4gP6yU&feature=related GPFS with declustered RAID not only decreases rebuild times and/or impact of rebuilds on end-user operations, but it also happens to increase reliability - there is a smaller time window in case of multiple-disk failure in a large RAID-6 or RAID-7 array (in the example they use 47-disk sets) that the data is left in a "critical state" due to lack of redundancy, and there is less data overall in such state - so the system goes from critical to simply degraded (with some redundancy) in a few minutes. Another thing they have in GPFS is temporary offlining of disks so that they can catch up when reattached - only newer writes (bigger TXG numbers in ZFS terms) are added to reinserted disks. I am not sure this exists in ZFS today, either. This might simplify physical systems maintenance (as it does for IBM boxes - see presentation if interested) and quick recovery from temporarily unavailable disks, such as when a disk gets a bus reset and is unavailable for writes for a few seconds (or more) while the array keeps on writing. I find these ideas cool. I do believe that IBM might get angry if ZFS development copy-pasted them "as is", but it might get nonetheless get us inventing a similar wheel that would be a bit different ;) There are already several vendors doing this in some way, so perhaps there is no (patent) monopoly in place already... And I think all the magic of spread spares and/or "declustered RAID" would go into just making another write-block allocator in the same league "raidz" or "mirror" are nowadays... BTW, are such allocators pluggable (as software modules)? What do you think - can and should such ideas find their way into ZFS? Or why not? Perhaps from theoretical or real-life experience with such storage approaches? //Jim Klimov
Bob Friesenhahn
2012-Jan-08 00:28 UTC
[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
On Sat, 7 Jan 2012, Jim Klimov wrote:> > Several RAID systems have implemented "spread" spare drives > in the sense that there is not an idling disk waiting to > receive a burst of resilver data filling it up, but the > capacity of the spare disk is spread among all drives in > the array. As a result, the healthy array gets one more > spindle and works a little faster, and rebuild times are > often decreased since more spindles can participate in > repairs at the same time.I think that I would also be interested in a system which uses the so-called spare disks for more protective redundancy but then reduces that protective redundancy in order to use that disk to replace a failed disk or to automatically enlarge the pool. For example, a pool could start out with four-way mirroring when there is little data in the pool. When the pool becomes more full, mirror devices are automatically removed (from existing vdevs), and used to add more vdevs. Eventually a limit would be hit so that no more mirrors are allowed to be removed. Obviously this approach works with simple mirrors but not for raidz. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Elling
2012-Jan-08 01:37 UTC
[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
Hi Jim, On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote:> Hello all, > > I have a new idea up for discussion. > > Several RAID systems have implemented "spread" spare drives > in the sense that there is not an idling disk waiting to > receive a burst of resilver data filling it up, but the > capacity of the spare disk is spread among all drives in > the array. As a result, the healthy array gets one more > spindle and works a little faster, and rebuild times are > often decreased since more spindles can participate in > repairs at the same time.Xiotech has a distributed, relocatable model, but the FRU is the whole ISE. There have been other implementations of more distributed RAIDness in the past (RAID-1E, etc). The big question is whether they are worth the effort. Spares solve a serviceability problem and only impact availability in an indirect manner. For single-parity solutions, spares can make a big difference in MTTDL, but have almost no impact on MTTDL for double-parity solutions (eg. raidz2).> I don''t think I''ve seen such idea proposed for ZFS, and > I do wonder if it is at all possible with variable-width > stripes? Although if the disk is sliced in 200 metaslabs > or so, implementing a spread-spare is a no-brainer as well.Put some thoughts down on paper and work through the math. If it all works out, let''s implement it! -- richard> > To be honest, I''ve seen this a long time ago in (Falcon?) > RAID controllers, and recently - in a USEnix presentation > of IBM GPFS on YouTube. In the latter the speaker goes > a greater depth describing how their "declustered RAID" > approach (as they call it: all blocks - spare, redundancy > and data are intermixed evenly on all drives and not in > a single "group" or a mid-level VDEV as would be for ZFS). > > http://www.youtube.com/watch?v=2g5rx4gP6yU&feature=related > > GPFS with declustered RAID not only decreases rebuild > times and/or impact of rebuilds on end-user operations, > but it also happens to increase reliability - there is > a smaller time window in case of multiple-disk failure > in a large RAID-6 or RAID-7 array (in the example they > use 47-disk sets) that the data is left in a "critical > state" due to lack of redundancy, and there is less data > overall in such state - so the system goes from critical > to simply degraded (with some redundancy) in a few minutes. > > Another thing they have in GPFS is temporary offlining > of disks so that they can catch up when reattached - only > newer writes (bigger TXG numbers in ZFS terms) are added to > reinserted disks. I am not sure this exists in ZFS today, > either. This might simplify physical systems maintenance > (as it does for IBM boxes - see presentation if interested) > and quick recovery from temporarily unavailable disks, such > as when a disk gets a bus reset and is unavailable for writes > for a few seconds (or more) while the array keeps on writing. > > I find these ideas cool. I do believe that IBM might get > angry if ZFS development copy-pasted them "as is", but it > might get nonetheless get us inventing a similar wheel > that would be a bit different ;) > There are already several vendors doing this in some way, > so perhaps there is no (patent) monopoly in place already... > > And I think all the magic of spread spares and/or "declustered > RAID" would go into just making another write-block allocator > in the same league "raidz" or "mirror" are nowadays... > BTW, are such allocators pluggable (as software modules)? > > What do you think - can and should such ideas find their > way into ZFS? Or why not? Perhaps from theoretical or > real-life experience with such storage approaches? > > //Jim Klimov > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ZFS and performance consulting http://www.RichardElling.com illumos meetup, Jan 10, 2012, Menlo Park, CA http://www.meetup.com/illumos-User-Group/events/41665962/
Jim Klimov
2012-Jan-08 02:59 UTC
[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
2012-01-08 5:37, Richard Elling ?????:> The big question is whether they are worth the effort. Spares solve a serviceability > problem and only impact availability in an indirect manner. For single-parity > solutions, spares can make a big difference in MTTDL, but have almost no impact > on MTTDL for double-parity solutions (eg. raidz2).Well, regarding this part: in the presentation linked in my OP, the IBM presenter suggests that for a 6-disk raid10 (3 mirrors) with one spare drive, overall a 7-disk set, there are such options for "critical" hits to data redundancy when one of drives dies: 1) Traditional RAID - one full disk is a mirror of another full disk; 100% of a disk''s size is "critical" and has to be prelicated into a spare drive ASAP; 2) Declustered RAID - all 7 disks are used for 2 unique data blocks from "original" setup and one spare block (I am not sure I described it well in words, his diagram shows it better); if a single disk dies, only 1/7 worth of disk size is critical (not redundant) and can be fixed faster. For their typical 47-disk sets of RAID-7-like redundancy, under 1% of data becomes critical when 3 disks die at once, which is (deemed) unlikely as is. Apparently, in the GPFS layout, MTTDL is much higher than in raid10+spare with all other stats being similar. I am not sure I''m ready (or qualified) to sit down and present the math right now - I just heard some ideas that I considered worth sharing and discussing ;) Thanks for the input, //Jim
Tim Cook
2012-Jan-08 03:15 UTC
[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
On Sat, Jan 7, 2012 at 7:37 PM, Richard Elling <richard.elling at gmail.com>wrote:> Hi Jim, > > On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote: > > > Hello all, > > > > I have a new idea up for discussion. > > > > Several RAID systems have implemented "spread" spare drives > > in the sense that there is not an idling disk waiting to > > receive a burst of resilver data filling it up, but the > > capacity of the spare disk is spread among all drives in > > the array. As a result, the healthy array gets one more > > spindle and works a little faster, and rebuild times are > > often decreased since more spindles can participate in > > repairs at the same time. > > Xiotech has a distributed, relocatable model, but the FRU is the whole ISE. > There have been other implementations of more distributed RAIDness in the > past (RAID-1E, etc). > > The big question is whether they are worth the effort. Spares solve a > serviceability > problem and only impact availability in an indirect manner. For > single-parity > solutions, spares can make a big difference in MTTDL, but have almost no > impact > on MTTDL for double-parity solutions (eg. raidz2). >I disagree. Dedicated spares impact far more than availability. During a rebuild performance is, in general, abysmal. ZIL and L2ARC will obviously help (L2ARC more than ZIL), but at the end of the day, if we''ve got a 12 hour rebuild (fairly conservative in the days of 2TB SATA drives), the performance degradation is going to be very real for end-users. With distributed parity and spares, you should in theory be able to cut this down an order of magnitude. I feel as though you''re brushing this off as not a big deal when it''s an EXTREMELY big deal (in my mind). In my opinion you can''t just approach this from an MTTDL perspective, you also need to take into account user experience. Just because I haven''t lost data, doesn''t mean the system isn''t (essentially) unavailable (sorry for the double negative and repeated parenthesis). If I can''t use the system due to performance being a fraction of what it is during normal production, it might as well be an outage.> > > I don''t think I''ve seen such idea proposed for ZFS, and > > I do wonder if it is at all possible with variable-width > > stripes? Although if the disk is sliced in 200 metaslabs > > or so, implementing a spread-spare is a no-brainer as well. > > Put some thoughts down on paper and work through the math. If it all works > out, let''s implement it! > -- richard > >I realize it''s not intentional Richard, but that response is more than a bit condescending. If he could just put it down on paper and code something up, I strongly doubt he would be posting his thoughts here. He would be posting results. The intention of his post, as far as I can tell, is to perhaps inspire someone who CAN just write down the math and write up the code to do so. Or at least to have them review his thoughts and give him a dev''s perspective on how viable bringing something like this to ZFS is. I fear responses like "the code is there, figure it out" makes the *aris community no better than the linux one.> > > > To be honest, I''ve seen this a long time ago in (Falcon?) > > RAID controllers, and recently - in a USEnix presentation > > of IBM GPFS on YouTube. In the latter the speaker goes > > a greater depth describing how their "declustered RAID" > > approach (as they call it: all blocks - spare, redundancy > > and data are intermixed evenly on all drives and not in > > a single "group" or a mid-level VDEV as would be for ZFS). > > > > http://www.youtube.com/watch?v=2g5rx4gP6yU&feature=related > > > > GPFS with declustered RAID not only decreases rebuild > > times and/or impact of rebuilds on end-user operations, > > but it also happens to increase reliability - there is > > a smaller time window in case of multiple-disk failure > > in a large RAID-6 or RAID-7 array (in the example they > > use 47-disk sets) that the data is left in a "critical > > state" due to lack of redundancy, and there is less data > > overall in such state - so the system goes from critical > > to simply degraded (with some redundancy) in a few minutes. > > > > Another thing they have in GPFS is temporary offlining > > of disks so that they can catch up when reattached - only > > newer writes (bigger TXG numbers in ZFS terms) are added to > > reinserted disks. I am not sure this exists in ZFS today, > > either. This might simplify physical systems maintenance > > (as it does for IBM boxes - see presentation if interested) > > and quick recovery from temporarily unavailable disks, such > > as when a disk gets a bus reset and is unavailable for writes > > for a few seconds (or more) while the array keeps on writing. > > > > I find these ideas cool. I do believe that IBM might get > > angry if ZFS development copy-pasted them "as is", but it > > might get nonetheless get us inventing a similar wheel > > that would be a bit different ;) > > There are already several vendors doing this in some way, > > so perhaps there is no (patent) monopoly in place already... > > > > And I think all the magic of spread spares and/or "declustered > > RAID" would go into just making another write-block allocator > > in the same league "raidz" or "mirror" are nowadays... > > BTW, are such allocators pluggable (as software modules)? > > > > What do you think - can and should such ideas find their > > way into ZFS? Or why not? Perhaps from theoretical or > > real-life experience with such storage approaches? > > > > //Jim Klimov > > > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- > > ZFS and performance consulting > http://www.RichardElling.com > illumos meetup, Jan 10, 2012, Menlo Park, CA > http://www.meetup.com/illumos-User-Group/events/41665962/ > >As always, feel free to tell me why my rant is completely off base ;) --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120107/65d425ff/attachment.html>
Jim Klimov
2012-Jan-08 14:29 UTC
[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
First of all, I would like to thank Bob, Richard and Tim for at least taking time to look at this proposal and responding ;) It is also encouraging to see that 2 of 3 responders consider this idea at least worth pondering and discussng, as it appeals to their direct interest. Even Richard was not dismissive of it ;) Finally, as Tim was right to note, I am not a kernel developer (and won''t become one as good as those present on this list). Of course, I could "pull the blanket onto my side" and say that I''d try to write that code myself... but it would probably be a long wait, like that for "BP rewrite" - because, I already have quite a few commitments and responsibilities as an admin and recently as a parent (yay!) So, I guess, my piece of the pie is currently limited to RFEs and bug reports... and working in IT for a software development company, I believe (or hope) that''s not a useless part of the process ;) I do believe that ZFS technology is amazing - despite some shortcomings that are still present - and I do want to see it flourish... ASAP! :^) //Jim 2012-01-08 7:15, Tim Cook wrote:> > > On Sat, Jan 7, 2012 at 7:37 PM, Richard Elling <richard.elling at gmail.com > <mailto:richard.elling at gmail.com>> wrote: > > Hi Jim, > > On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote: > > > Hello all, > > > > I have a new idea up for discussion. > >...> > > I disagree. Dedicated spares impact far more than availability. During > a rebuild performance is, in general, abysmal. ... > If I can''t use the system due to performance being a fraction of what > it is during normal production, it might as well be an outage. > > > > > I don''t think I''ve seen such idea proposed for ZFS, and > > I do wonder if it is at all possible with variable-width > > stripes? Although if the disk is sliced in 200 metaslabs > > or so, implementing a spread-spare is a no-brainer as well. > > Put some thoughts down on paper and work through the math. If it all > works > out, let''s implement it! > -- richard > > > I realize it''s not intentional Richard, but that response is more than a > bit condescending. If he could just put it down on paper and code > something up, I strongly doubt he would be posting his thoughts here. > He would be posting results. The intention of his post, as far as I > can tell, is to perhaps inspire someone who CAN just write down the math > and write up the code to do so. Or at least to have them review his > thoughts and give him a dev''s perspective on how viable bringing > something like this to ZFS is. I fear responses like "the code is > there, figure it out" makes the *aris community no better than the linux > one. > > > > > What do you think - can and should such ideas find their > > way into ZFS? Or why not? Perhaps from theoretical or > > real-life experience with such storage approaches? > > > > //Jim Klimov > > As always, feel free to tell me why my rant is completely off base ;) > > --Tim
Pasi Kärkkäinen
2012-Jan-08 20:12 UTC
[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
On Sun, Jan 08, 2012 at 06:59:57AM +0400, Jim Klimov wrote:> 2012-01-08 5:37, Richard Elling ??????????: >> The big question is whether they are worth the effort. Spares solve a serviceability >> problem and only impact availability in an indirect manner. For single-parity >> solutions, spares can make a big difference in MTTDL, but have almost no impact >> on MTTDL for double-parity solutions (eg. raidz2). > > Well, regarding this part: in the presentation linked in my OP, > the IBM presenter suggests that for a 6-disk raid10 (3 mirrors) > with one spare drive, overall a 7-disk set, there are such > options for "critical" hits to data redundancy when one of > drives dies: > > 1) Traditional RAID - one full disk is a mirror of another > full disk; 100% of a disk''s size is "critical" and has to > be prelicated into a spare drive ASAP; > > 2) Declustered RAID - all 7 disks are used for 2 unique data > blocks from "original" setup and one spare block (I am not > sure I described it well in words, his diagram shows it > better); if a single disk dies, only 1/7 worth of disk > size is critical (not redundant) and can be fixed faster. > > For their typical 47-disk sets of RAID-7-like redundancy, > under 1% of data becomes critical when 3 disks die at once, > which is (deemed) unlikely as is. > > Apparently, in the GPFS layout, MTTDL is much higher than > in raid10+spare with all other stats being similar. > > I am not sure I''m ready (or qualified) to sit down and present > the math right now - I just heard some ideas that I considered > worth sharing and discussing ;) >Thanks for the video link (http://www.youtube.com/watch?v=2g5rx4gP6yU). It''s very interesting! GPFS Native RAID seems to be more advanced than current ZFS, and it even has rebalancing implemented (the infamous missing zfs bp-rewrite). It''d definitely be interesting to have something like this implemented in ZFS. -- Pasi
Richard Elling
2012-Jan-09 02:25 UTC
[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
Note: more analysis of the GPFS implementations is needed, but that will take more time than I''ll spend this evening :-) Quick hits below... On Jan 7, 2012, at 7:15 PM, Tim Cook wrote:> On Sat, Jan 7, 2012 at 7:37 PM, Richard Elling <richard.elling at gmail.com> wrote: > Hi Jim, > > On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote: > > > Hello all, > > > > I have a new idea up for discussion. > > > > Several RAID systems have implemented "spread" spare drives > > in the sense that there is not an idling disk waiting to > > receive a burst of resilver data filling it up, but the > > capacity of the spare disk is spread among all drives in > > the array. As a result, the healthy array gets one more > > spindle and works a little faster, and rebuild times are > > often decreased since more spindles can participate in > > repairs at the same time. > > Xiotech has a distributed, relocatable model, but the FRU is the whole ISE. > There have been other implementations of more distributed RAIDness in the > past (RAID-1E, etc). > > The big question is whether they are worth the effort. Spares solve a serviceability > problem and only impact availability in an indirect manner. For single-parity > solutions, spares can make a big difference in MTTDL, but have almost no impact > on MTTDL for double-parity solutions (eg. raidz2). > > > I disagree. Dedicated spares impact far more than availability. During a rebuild performance is, in general, abysmal.In ZFS, there is a resilver throttle that is designed to ensure that resilvering activity does not impact interactive performance. Do you have data that suggests otherwise?> ZIL and L2ARC will obviously help (L2ARC more than ZIL),ZIL makes zero impact on resilver. I''ll have to check to see if L2ARC is still used, but due to the nature of the ARC design, read-once workloads like backup or resilver do not tend to negatively impact frequently used data.> but at the end of the day, if we''ve got a 12 hour rebuild (fairly conservative in the days of 2TB > SATA drives), the performance degradation is going to be very real for end-users.I''d like to see some data on this for modern ZFS implementations (post Summer 2010)> With distributed parity and spares, you should in theory be able to cut this down an order of magnitude. > I feel as though you''re brushing this off as not a big deal when it''s an EXTREMELY big deal (in my mind). In my opinion you can''t just approach this from an MTTDL perspective, you also need to take into account user experience. Just because I haven''t lost data, doesn''t mean the system isn''t (essentially) unavailable (sorry for the double negative and repeated parenthesis). If I can''t use the system due to performance being a fraction of what it is during normal production, it might as well be an outage.So we have a method to analyze the ability of a system to perform during degradation: performability. This can be applied to computer systems and we''ve done some analysis specifically on RAID arrays. See also http://www.springerlink.com/content/267851748348k382/ http://blogs.oracle.com/relling/tags/performability Hence my comment about "doing some math" :-)> > I don''t think I''ve seen such idea proposed for ZFS, and > > I do wonder if it is at all possible with variable-width > > stripes? Although if the disk is sliced in 200 metaslabs > > or so, implementing a spread-spare is a no-brainer as well. > > Put some thoughts down on paper and work through the math. If it all works > out, let''s implement it! > -- richard > > > I realize it''s not intentional Richard, but that response is more than a bit condescending. If he could just put it down on paper and code something up, I strongly doubt he would be posting his thoughts here. He would be posting results. The intention of his post, as far as I can tell, is to perhaps inspire someone who CAN just write down the math and write up the code to do so. Or at least to have them review his thoughts and give him a dev''s perspective on how viable bringing something like this to ZFS is. I fear responses like "the code is there, figure it out" makes the *aris community no better than the linux one.When I talk about spares in tutorials, we discuss various tradeoffs and how to analyse the systems. Interestingly, for the GPFS case, the mirrors example clearly shows the benefit of declustered RAID. However, the triple-parity example (similar to raidz3) is not so persuasive. If you have raidz3 + spares, then why not go ahead and do raidz4? In the tutorial we work through a raidz2 + spare vs raidz2 case and the raidz2 case is better in both performance and dependability without sacrificing space (an unusual condition!) It is not very difficult to add a raidz4 or indeed any number of additional parity, but there is a point of diminishing returns, usually when some other system component becomes more critical than the RAID protection. So, raidz4 + spare is less dependable than raidz5, and so on. -- richard> > > > To be honest, I''ve seen this a long time ago in (Falcon?) > > RAID controllers, and recently - in a USEnix presentation > > of IBM GPFS on YouTube. In the latter the speaker goes > > a greater depth describing how their "declustered RAID" > > approach (as they call it: all blocks - spare, redundancy > > and data are intermixed evenly on all drives and not in > > a single "group" or a mid-level VDEV as would be for ZFS). > > > > http://www.youtube.com/watch?v=2g5rx4gP6yU&feature=related > > > > GPFS with declustered RAID not only decreases rebuild > > times and/or impact of rebuilds on end-user operations, > > but it also happens to increase reliability - there is > > a smaller time window in case of multiple-disk failure > > in a large RAID-6 or RAID-7 array (in the example they > > use 47-disk sets) that the data is left in a "critical > > state" due to lack of redundancy, and there is less data > > overall in such state - so the system goes from critical > > to simply degraded (with some redundancy) in a few minutes. > > > > Another thing they have in GPFS is temporary offlining > > of disks so that they can catch up when reattached - only > > newer writes (bigger TXG numbers in ZFS terms) are added to > > reinserted disks. I am not sure this exists in ZFS today, > > either. This might simplify physical systems maintenance > > (as it does for IBM boxes - see presentation if interested) > > and quick recovery from temporarily unavailable disks, such > > as when a disk gets a bus reset and is unavailable for writes > > for a few seconds (or more) while the array keeps on writing. > > > > I find these ideas cool. I do believe that IBM might get > > angry if ZFS development copy-pasted them "as is", but it > > might get nonetheless get us inventing a similar wheel > > that would be a bit different ;) > > There are already several vendors doing this in some way, > > so perhaps there is no (patent) monopoly in place already... > > > > And I think all the magic of spread spares and/or "declustered > > RAID" would go into just making another write-block allocator > > in the same league "raidz" or "mirror" are nowadays... > > BTW, are such allocators pluggable (as software modules)? > > > > What do you think - can and should such ideas find their > > way into ZFS? Or why not? Perhaps from theoretical or > > real-life experience with such storage approaches? > > > > //Jim Klimov > > > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- > > ZFS and performance consulting > http://www.RichardElling.com > illumos meetup, Jan 10, 2012, Menlo Park, CA > http://www.meetup.com/illumos-User-Group/events/41665962/ > > > > As always, feel free to tell me why my rant is completely off base ;) > > --Tim-- ZFS and performance consulting http://www.RichardElling.com illumos meetup, Jan 10, 2012, Menlo Park, CA http://www.meetup.com/illumos-User-Group/events/41665962/
Jim Klimov
2012-Jan-09 02:47 UTC
[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
2012-01-09 6:25, Richard Elling wrote:> Note: more analysis of the GPFS implementations is needed, but that will take more > time than I''ll spend this evening :-) Quick hits below...Good to hear you might look into it after all ;)>> but at the end of the day, if we''ve got a 12 hour rebuild (fairly conservative in the days of 2TB >> SATA drives), the performance degradation is going to be very real for end-users. > > I''d like to see some data on this for modern ZFS implementations (post Summer 2010) >Is "scrubbing performance" irrelevant in this discussion? I think that in general, scrubbing is the read-half of a larger rebuild process, at least for a single-vdev pool, so rebuilds are about as long or worse. Am I wrong? In my home-NAS case a raidz2 pool of six 2Tb drives, which is filled 76%, consistently takes 85 hours to scrub. No SSDs involved, no L2ARC, no ZILs. According to iostat, the HDDs are often utilized to 100% with random IO load, yielding from 500KBps to 2-3MBps in about 80-100IOPS per disk (I have a scrub going on at this moment). This system variably runs oi_148a (LiveUSB recovery) and oi_151a when alive ;) HTH, //Jim Klimov
Karl Wagner
2012-Jan-10 17:26 UTC
[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
On Sun, January 8, 2012 00:28, Bob Friesenhahn wrote:> > I think that I would also be interested in a system which uses the > so-called spare disks for more protective redundancy but then reduces > that protective redundancy in order to use that disk to replace a > failed disk or to automatically enlarge the pool. > > For example, a pool could start out with four-way mirroring when there > is little data in the pool. When the pool becomes more full, mirror > devices are automatically removed (from existing vdevs), and used to > add more vdevs. Eventually a limit would be hit so that no more > mirrors are allowed to be removed. > > Obviously this approach works with simple mirrors but not for raidz. > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >I actually disagree about raidz. I have often thought that a "dynamic raidz" would be a great feature. For instance, you have a 4-way raidz. What you are saying is you want the array to survive the loss of a single drive. So, from an empty vdev, it starts by writing 2 copies of each block, effectively creating a pair of mirrors. These are quicker to write and quicker to resilver than parity, and you would likely get a read speed increase too. As the vdev starts to get full, it starts using a parity based redundancy, and converting "older" data to this as well. Performance drops a bit, but it happens slowly. In addition, any older blocks not yet converted are still quicker to read and resilver. This is only a theory, but it is certainly something which could be considered. It would probably take a lot of rewriting of the raidz code, though.
Daniel Carosone
2012-Jan-12 04:05 UTC
[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
On Sun, Jan 08, 2012 at 06:25:05PM -0800, Richard Elling wrote:> ZIL makes zero impact on resilver. I''ll have to check to see if L2ARC is still used, but > due to the nature of the ARC design, read-once workloads like backup or resilver do > not tend to negatively impact frequently used data.This is true, in a strict sense (they don''t help resilver itself) but it misses the point. They (can) help the system, when resilver is underway. ZIL helps reduce the impact busy resilvering disks have on other system operation (sync write syscalls and vfs ops by apps). L2ARC, likewise for reads. Both can hide the latency increases that resilvering iops cause for the disks (and which the throttle you mentioned also attempts to minimise). -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120112/d78e8edb/attachment.bin>
Daniel Carosone
2012-Jan-12 05:32 UTC
[zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
On Thu, Jan 12, 2012 at 03:05:32PM +1100, Daniel Carosone wrote:> On Sun, Jan 08, 2012 at 06:25:05PM -0800, Richard Elling wrote: > > ZIL makes zero impact on resilver. I''ll have to check to see if L2ARC is still used, but > > due to the nature of the ARC design, read-once workloads like backup or resilver do > > not tend to negatively impact frequently used data. > > This is true, in a strict sense (they don''t help resilver itself) but > it misses the point. They (can) help the system, when resilver is > underway. > > ZIL helps reduce the impact busy resilvering disks have on other systemWell, since I''m being strict and picky, I should of course say ZIL-on-slog.> operation (sync write syscalls and vfs ops by apps). L2ARC, likewise > for reads. Both can hide the latency increases that resilvering iops > cause for the disks (and which the throttle you mentioned also > attempts to minimise).-- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120112/48a33af6/attachment.bin>