Recently I talked to a co-worker who manages NetApp storages. We discussed size changes for pools in zfs and aggregates in NetApp. And some time before I had suggested to a my buddy zfs for his new home storage server, but he turned it down since there is no expansion available for a pool. And he really wants to be able to add a drive or couple to an existing pool. Yes, there are ways to expand storage to some extent without rebuilding it. Like replacing disk with larger ones. Not enough for a typical home user I would say. And this is might be an important for corporate too. Frankly speaking I doubt there are many administrators use it in DC environment. Nevertheless, NetApp appears to have such feature as I learned from my co-worker. It works with some restrictions (you have to zero disks before adding, and rebalance the aggregate after and still without perfect distribution) - but Ontap is able to do aggregates expansion nevertheless. So, my question is: what does prevent to introduce the same for zfs at present time? Is this because of the design of zfs, or there is simply no demand for it in community? My understanding is that at present time there are no plans to introduce it. --Regards, Roman Naumenko roman at naumenko.com -- This message posted from opensolaris.org
On 6/2/10 3:54 PM -0700 Roman Naumenko wrote:> And some time before I had suggested to a my buddy zfs for his new home > storage server, but he turned it down since there is no expansion > available for a pool.That''s incorrect. zfs pools can be expanded at any time. AFAIK zfs has always had this capability.> Nevertheless, NetApp appears to have such feature as I learned from my > co-worker. It works with some restrictions (you have to zero disks before > adding, and rebalance the aggregate after and still without perfect > distribution) - but Ontap is able to do aggregates expansion > nevertheless.I wasn''t aware that Netapp could rebalance. Is that a true Netapp feature, or is it a matter of copying the data "manually"? zfs doesn''t have a cleaner process that rebalances, so for zfs you would have to copy the data to rebalance the pool. I certainly wouldn''t make my Netapp/zfs decision based on that (alone). -frank
On Wed, Jun 2, 2010 at 3:54 PM, Roman Naumenko <roman at naumenko.ca> wrote:> Recently I talked to a co-worker who manages NetApp storages. We discussed > size changes for pools in zfs and aggregates in NetApp. > > And some time before I had suggested to a my buddy zfs for his new home > storage server, but he turned it down since there is no expansion available > for a pool. >There are two ways to increase the storage space available to a ZFS pool: 1. add more vdevs to the pool 2. replace each drive in a vdev with a larger drive The first option "expands the width" of the pool, adds redundancy to the pool, and (should) increase the performance of the pool. This is very simple to do, but requires having the drive bays and/or drive connectors available. (In fact, any time you add a vdev to a pool, including when you first create it, you go through this process.) The second option "increases the total storage" of the pool, without changing any of the redundancy of the pool. Performance may or may not increase. Once all the drives in a vdev are replaced, the storage space becomes available to the pool (depending on the ZFS version, you may need to export/import the pool for the space to become available). We''ve used both of the above quite successfully, both at home and at work. Not sure what your buddy was talking about. :) -- Freddie Cash fjwcash at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100602/ec7f26e4/attachment.html>
Roman Naumenko wrote:> Recently I talked to a co-worker who manages NetApp storages. We discussed size changes for pools in zfs and aggregates in NetApp. > > And some time before I had suggested to a my buddy zfs for his new home storage server, but he turned it down since there is no expansion available for a pool. > > And he really wants to be able to add a drive or couple to an existing pool. Yes, there are ways to expand storage to some extent without rebuilding it. Like replacing disk with larger ones. Not enough for a typical home user I would say. > > And this is might be an important for corporate too. Frankly speaking I doubt there are many administrators use it in DC environment. > > Nevertheless, NetApp appears to have such feature as I learned from my co-worker. It works with some restrictions (you have to zero disks before adding, and rebalance the aggregate after and still without perfect distribution) - but Ontap is able to do aggregates expansion nevertheless. > > So, my question is: what does prevent to introduce the same for zfs at present time? Is this because of the design of zfs, or there is simply no demand for it in community? > > My understanding is that at present time there are no plans to introduce it. > > --Regards, > Roman Naumenko > roman at naumenko.com >Expanding a RAIDZ (which, really, is the only thing that can''t do right now, w/r/t adding disks) requires the Block Pointer (BP) Rewrite functionality before it can get implemented. We''ve been promised BP rewrite for awhile, but I have no visibility as to where development on it is in the schedule. Fortunately, several other things also depend on BP rewrite (e.g. shrinking a pool (removing vdevs), efficient defragmentation/compaction, etc.). So, while resizing a raidZ device isn''t really high on the list of things to do, the fundamental building block which would allow for it to occur is very much important for Oracle. And, once BP rewrite is available, I suspect that there might be a raidZ resize contribution from one of the non-Oracle folks. Or, maybe even someone like me (whose not a ZFS developer inside Oracle, but I play one on TV...) Dev guys - where are we on BP rewrite? -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
On Jun 2, 2010, at 4:08 PM, Freddie Cash wrote:> On Wed, Jun 2, 2010 at 3:54 PM, Roman Naumenko <roman at naumenko.ca> wrote: > Recently I talked to a co-worker who manages NetApp storages. We discussed size changes for pools in zfs and aggregates in NetApp. > > And some time before I had suggested to a my buddy zfs for his new home storage server, but he turned it down since there is no expansion available for a pool. > > There are two ways to increase the storage space available to a ZFS pool: > 1. add more vdevs to the pool > 2. replace each drive in a vdev with a larger drive3. grow a LUN and export/import (old releases) or toggle autoexpand=on (later release) -- richard -- Richard Elling richard at nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/
On Jun 2, 2010, at 3:54 PM, Roman Naumenko wrote:> Recently I talked to a co-worker who manages NetApp storages. We discussed size changes for pools in zfs and aggregates in NetApp. > > And some time before I had suggested to a my buddy zfs for his new home storage server, but he turned it down since there is no expansion available for a pool.Heck, let him buy a NetApp :-)> And he really wants to be able to add a drive or couple to an existing pool. Yes, there are ways to expand storage to some extent without rebuilding it. Like replacing disk with larger ones. Not enough for a typical home user I would say.Why not? I do this quite often. Growing is easy, shrinking is more challenging.> And this is might be an important for corporate too. Frankly speaking I doubt there are many administrators use it in DC environment. > > Nevertheless, NetApp appears to have such feature as I learned from my co-worker. It works with some restrictions (you have to zero disks before adding, and rebalance the aggregate after and still without perfect distribution) - but Ontap is able to do aggregates expansion nevertheless. > > So, my question is: what does prevent to introduce the same for zfs at present time? Is this because of the design of zfs, or there is simply no demand for it in community?Its been there since 2005: zpool subcommand add. -- richard> > My understanding is that at present time there are no plans to introduce it. > > --Regards, > Roman Naumenko > roman at naumenko.com-- Richard Elling richard at nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/
Richard Elling said the following, on 06/02/2010 08:50 PM:> On Jun 2, 2010, at 3:54 PM, Roman Naumenko wrote: > >> Recently I talked to a co-worker who manages NetApp storages. We discussed size changes for pools in zfs and aggregates in NetApp. >> And some time before I had suggested to a my buddy zfs for his new home storage server, but he turned it down since there is no expansion available for a pool. >> > Heck, let him buy a NetApp :-) >No chances, he likes to do build everything by himself.>> And he really wants to be able to add a drive or couple to an existing pool. Yes, there are ways to expand storage to some extent without rebuilding it. Like replacing disk with larger ones. Not enough for a typical home user I would say. >> > Why not? I do this quite often. Growing is easy, shrinking is more challenging. > >> And this is might be an important for corporate too. Frankly speaking I doubt there are many administrators use it in DC environment. >> >> Nevertheless, NetApp appears to have such feature as I learned from my co-worker. It works with some restrictions (you have to zero disks before adding, and rebalance the aggregate after and still without perfect distribution) - but Ontap is able to do aggregates expansion nevertheless. >> >> So, my question is: what does prevent to introduce the same for zfs at present time? Is this because of the design of zfs, or there is simply no demand for it in community? >> > Its been there since 2005: zpool subcommand add. > -- richardWell, I explained it not very clearly. I meant the size of a raidz array can''t be changed. For sure zpool add can do the job with a pool. Not with a raidz configuration. Roman Naumenko roman at naumenko.ca
Erik Trimble said the following, on 06/02/2010 07:16 PM:> Roman Naumenko wrote: >> Recently I talked to a co-worker who manages NetApp storages. We >> discussed size changes for pools in zfs and aggregates in NetApp. >> >> And some time before I had suggested to a my buddy zfs for his new >> home storage server, but he turned it down since there is no >> expansion available for a pool. >> And he really wants to be able to add a drive or couple to an >> existing pool. Yes, there are ways to expand storage to some extent >> without rebuilding it. Like replacing disk with larger ones. Not >> enough for a typical home user I would say. >> And this is might be an important for corporate too. Frankly speaking >> I doubt there are many administrators use it in DC environment. >> Nevertheless, NetApp appears to have such feature as I learned from >> my co-worker. It works with some restrictions (you have to zero disks >> before adding, and rebalance the aggregate after and still without >> perfect distribution) - but Ontap is able to do aggregates expansion >> nevertheless. >> So, my question is: what does prevent to introduce the same for zfs >> at present time? Is this because of the design of zfs, or there is >> simply no demand for it in community? >> >> My understanding is that at present time there are no plans to >> introduce it. >> >> --Regards, >> Roman Naumenko >> roman at naumenko.com > > Expanding a RAIDZ (which, really, is the only thing that can''t do > right now, w/r/t adding disks) requires the Block Pointer (BP) Rewrite > functionality before it can get implemented. > > We''ve been promised BP rewrite for awhile, but I have no visibility as > to where development on it is in the schedule.I though it''s about hard defined udev configuration when a raidz is created. But anyway, it''s just not there... --Roman roman at naumenko.ca
On Wed, Jun 2, 2010 at 3:54 PM, Roman Naumenko <roman at naumenko.ca> wrote:> And some time before I had suggested to a my buddy zfs for his new home storage server, but he turned it down since there is no expansion available for a pool.There''s no expansion for aggregates in OnTap, either. You can add more disks (as a raid-dp or mirror set) to an existing aggr, but you can also add more vdevs (as raidz or mirrors) to a zpool two.> And he really wants to be able to add a drive or couple to an existing pool. Yes, there are ways to expand storage to some extent without rebuilding it. Like replacing disk with larger ones. Not enough for a typical home user I would say.You can do this. ''zpool add''> Nevertheless, NetApp appears to have such feature as I learned from my co-worker. It works with some restrictions (you have to zero disks before adding, and rebalance the aggregate after and still without perfect distribution) - but Ontap is able to do aggregates expansion nevertheless.Yeah, you can add to a aggr, but you can''t add to a raid-dp set. It''s the same as ZFS. ZFS doesn''t require that you zero disks, and there is no rebalancing. As more data is written to the pool it will become more balanced however.> So, my question is: what does prevent to introduce the same for zfs at present time? Is this because of the design of zfs, or there is simply no demand for it in community? > > My understanding is that at present time there are no plans to introduce it.Rebalancing depends on bp_rewrite, which is vaporware still. There has been discussion of it for a while but no implementation that I know of. Once the feature is added, it will be possible to add or remove devices from a zpool or vdev, something that OnTap can''t do. -B -- Brandon High : bhigh at freaks.com
Brandon High said the following, on 06/02/2010 11:47 PM:> On Wed, Jun 2, 2010 at 3:54 PM, Roman Naumenko<roman at naumenko.ca> wrote: > >> And some time before I had suggested to a my buddy zfs for his new home storage server, but he turned it down since there is no expansion available for a pool. >> > There''s no expansion for aggregates in OnTap, either. You can add more > disks (as a raid-dp or mirror set) to an existing aggr, but you can > also add more vdevs (as raidz or mirrors) to a zpool two. >I think there is a difference. Just quickly checked netapp site: Adding new disks to a RAID group If a volume has more than one RAID group, you can specify the RAID group to which you are adding disks. To add new disks to a specific RAID group of a volume, complete the following step. Example The following command adds two disks to RAID group 0 of the vol0 volume: vol add vol0 -g rg0 2 You can obviously add disks just to a raid group as well.>> And he really wants to be able to add a drive or couple to an existing pool. Yes, there are ways to expand storage to some extent without rebuilding it. Like replacing disk with larger ones. Not enough for a typical home user I would say. >> > You can do this. ''zpool add'' > >> Nevertheless, NetApp appears to have such feature as I learned from my co-worker. It works with some restrictions (you have to zero disks before adding, and rebalance the aggregate after and still without perfect distribution) - but Ontap is able to do aggregates expansion nevertheless. >> > Yeah, you can add to a aggr, but you can''t add to a raid-dp set. It''s > the same as ZFS. > > ZFS doesn''t require that you zero disks, and there is no rebalancing. > As more data is written to the pool it will become more balanced > however. >> So, my question is: what does prevent to introduce the same for zfs at present time? Is this because of the design of zfs, or there is simply no demand for it in community? >> >> My understanding is that at present time there are no plans to introduce it. >> > Rebalancing depends on bp_rewrite, which is vaporware still. There has > been discussion of it for a while but no implementation that I know > of. > > Once the feature is added, it will be possible to add or remove > devices from a zpool or vdev, something that OnTap can''t do. > >But are there any plans to implement it? --Roman -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100603/75bbab4e/attachment.html>
Richard Elling <richard at nexenta.com> writes:>> And some time before I had suggested to a my buddy zfs for his new >> home storage server, but he turned it down since there is no >> expansion available for a pool. > > Heck, let him buy a NetApp :-)Definitely a possibility, given the availability and pricing of oldish NetApp hardware on eBay. Although for home use, it is easier to put together something adequately power-saving and silent with OpenSolaris and PC hardware than with NetApp gear. -- I wasn''t so desperate yet that I actually looked into documentation. -- Juergen Nickelsen
On Wed, June 2, 2010 17:54, Roman Naumenko wrote:> Recently I talked to a co-worker who manages NetApp storages. We discussed > size changes for pools in zfs and aggregates in NetApp. > > And some time before I had suggested to a my buddy zfs for his new home > storage server, but he turned it down since there is no expansion > available for a pool.I set up my home fileserver with ZFS (in 2006) BECAUSE zfs could expand the pool for me, and nothing else I had access to could do that (home fileserver, little budget). My server is currently running with one data pool, three vdevs. Each of the data vdev is a two-way mirror. I started with one, expanded to two, then expanded to three. Rather than expanding to four when this fills up, I''m going to attach a larger drive to the first mirror vdev, and then a second one, and then remove the two current drives, thus expanding the vdev without ever compromising the redundancy. My choice of mirrors rather than RAIDZ is based on the fact that I have only 8 hot-swap bays (I still think of this as LARGE for a home server; the competition, things like the Drobo, tends to have 4 or 5), that I don''t need really large amounts of storage (after my latest upgrade I''m running with 1.2TB of available data space), and that I expected to need to expand storage over the life of the system. With mirror vdevs, I can expand them without compromising redundancy even temporarily, by attaching the new drives before I detach the old drives; I couldn''t do that with RAIDZ. Also, the fact that disk is now so cheap means that 100% redundancy is affordable, I don''t have to compromise on RAIDZ. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
> Expanding a RAIDZ (which, really, is the only thing > that can''t do right > now, w/r/t adding disks) requires the Block Pointer > (BP) Rewrite > functionality before it can get implemented. > > We''ve been promised BP rewrite for awhile, but I have > no visibility as > to where development on it is in the schedule. > > Fortunately, several other things also depend on BP > rewrite (e.g. > shrinking a pool (removing vdevs), efficient > defragmentation/compaction, > etc.). > > So, while resizing a raidZ device isn''t really high > on the list of > things to do, the fundamental building block which > would allow for it to > occur is very much important for Oracle. And, once BP > rewrite is > available, I suspect that there might be a raidZ > resize contribution > from one of the non-Oracle folks. Or, maybe even > someone like me (whose > not a ZFS developer inside Oracle, but I play one on > TV...) > > > Dev guys - where are we on BP rewrite?I was thinking about asking the same thing recently as I would really like to see BP rewrite implemented. This seems to pop up here every several months. There were rumblings last fall that the BP rewrite stuff would potentially be finished by now. The bug (CR 4852783) has been around for 7 years though. The functionality it would enable would be quite attractive for many more things than just expanding a raidz vdev, although that is my primary interest in it. The basics of expanding a raidz vdev once BP rewrite is done have already been outlined by Adam Leventhal. See the URL below. http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z Richard Bruce -- This message posted from opensolaris.org
Using a stripe of mirrors (RAID0) you can get the benefits of multiple spindle performance, easy expansion support (just add new mirrors to the end of the raid0 stripe), and 100% data redundancy. If you can afford to pay double for your storage (the cost of mirroring), this is IMO the best solution. Note that this solution is not quite as resilient against hardware failure as raidz2 or raidz3. While the RAID1+0 solution can tolerate multiple drive failures, if both both drives in a mirror fail, you lose data. If you''re clever, you''ll also try to make sure each side of the mirror is on a different controller, and if you have enough controllers available, you''ll also try to balance the controllers across stripes. One way to help with that is to leave a drive or two available as a hot spare. Btw, the above recommendation mirrors what Jeff Bonwick himself (the creator of ZFS) has advised on his blog. -- Garrett On Thu, 2010-06-03 at 09:06 -0500, David Dyer-Bennet wrote:> On Wed, June 2, 2010 17:54, Roman Naumenko wrote: > > Recently I talked to a co-worker who manages NetApp storages. We discussed > > size changes for pools in zfs and aggregates in NetApp. > > > > And some time before I had suggested to a my buddy zfs for his new home > > storage server, but he turned it down since there is no expansion > > available for a pool. > > I set up my home fileserver with ZFS (in 2006) BECAUSE zfs could expand > the pool for me, and nothing else I had access to could do that (home > fileserver, little budget). > > My server is currently running with one data pool, three vdevs. Each of > the data vdev is a two-way mirror. I started with one, expanded to two, > then expanded to three. Rather than expanding to four when this fills up, > I''m going to attach a larger drive to the first mirror vdev, and then a > second one, and then remove the two current drives, thus expanding the > vdev without ever compromising the redundancy. > > My choice of mirrors rather than RAIDZ is based on the fact that I have > only 8 hot-swap bays (I still think of this as LARGE for a home server; > the competition, things like the Drobo, tends to have 4 or 5), that I > don''t need really large amounts of storage (after my latest upgrade I''m > running with 1.2TB of available data space), and that I expected to need > to expand storage over the life of the system. With mirror vdevs, I can > expand them without compromising redundancy even temporarily, by attaching > the new drives before I detach the old drives; I couldn''t do that with > RAIDZ. Also, the fact that disk is now so cheap means that 100% > redundancy is affordable, I don''t have to compromise on RAIDZ.
On Thu, June 3, 2010 10:15, Garrett D''Amore wrote:> Using a stripe of mirrors (RAID0) you can get the benefits of multiple > spindle performance, easy expansion support (just add new mirrors to the > end of the raid0 stripe), and 100% data redundancy. If you can afford > to pay double for your storage (the cost of mirroring), this is IMO the > best solution.Referencing "RAID0" here in the context of ZFS is confusing, though. Are you suggesting using underlying RAID hardware to create virtual volumes to then present to ZFS, or what?> Note that this solution is not quite as resilient against hardware > failure as raidz2 or raidz3. While the RAID1+0 solution can tolerate > multiple drive failures, if both both drives in a mirror fail, you lose > data.In a RAIDZ solution, two or more drive failures lose your data. In a mirrored solution, losing the WRONG two drives will still lose your data, but you have some chance of surviving losing a random two drives. So I would describe the mirror solution as more resilient. So going to RAIDZ2 or even RAIDZ3 would be better, I agree. In an 8-bay chassis, there are other concerns, too. Do I keep space open for a hot spare? There''s no real point in a hot spare if you have only one vdev; that is, 8-drive RAIDZ3 is clearly better than 7-drive RAIDZ2 plus a hot spare. And putting everything into one vdev means that for any upgrade I have to replace all 8 drives at once, a financial problem for a home server.> If you''re clever, you''ll also try to make sure each side of the mirror > is on a different controller, and if you have enough controllers > available, you''ll also try to balance the controllers across stripes.I did manage to split the mirrors accross controllers (I have 6 SATA on the motherboard and I added an 8-port SAS card with SAS-SATA cabling).> One way to help with that is to leave a drive or two available as a hot > spare. > > Btw, the above recommendation mirrors what Jeff Bonwick himself (the > creator of ZFS) has advised on his blog.I believe that article directly influenced my choice, in fact. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On Wed, Jun 2, 2010 at 8:10 PM, Roman Naumenko <roman at naumenko.ca> wrote:> Well, I explained it not very clearly. I meant the size of a raidz array > can''t be changed. > For sure zpool add can do the job with a pool. Not with a raidz > configuration. >You can''t increase the number of drives in a raidz vdev, no. Going from a 4-drive raidz1 to a 5-drive raidz1 is currently impossible. And going from a raidz1 to a raidz2 vdev is currently impossible. On the flip side, it''s rare to find a hardware RAID controller that allows this. But you can increase the storage space available in a raidz vdev, by replacing each drive in the raidz vdev with a larger drive. We just did this, going from 8x 500 GB drives in a raidz2 vdev, to 8x 1.5 TB drives in a raidz2 vdev. -- Freddie Cash fjwcash at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100603/728f9727/attachment.html>
On Thu, 2010-06-03 at 10:35 -0500, David Dyer-Bennet wrote:> On Thu, June 3, 2010 10:15, Garrett D''Amore wrote: > > Using a stripe of mirrors (RAID0) you can get the benefits of multiple > > spindle performance, easy expansion support (just add new mirrors to the > > end of the raid0 stripe), and 100% data redundancy. If you can afford > > to pay double for your storage (the cost of mirroring), this is IMO the > > best solution. > > Referencing "RAID0" here in the context of ZFS is confusing, though. Are > you suggesting using underlying RAID hardware to create virtual volumes to > then present to ZFS, or what?RAID0 is basically the default configuration of a ZFS pool -- its a concatenation of the underlying vdevs. In this case the vdevs should themselves be two-drive mirrors. This of course has to be done in the ZFS layer, and ZFS doesn''t call it RAID0, any more than it calls a mirror RAID1, but effectively that''s what they are.> > > Note that this solution is not quite as resilient against hardware > > failure as raidz2 or raidz3. While the RAID1+0 solution can tolerate > > multiple drive failures, if both both drives in a mirror fail, you lose > > data. > > In a RAIDZ solution, two or more drive failures lose your data. In a > mirrored solution, losing the WRONG two drives will still lose your data, > but you have some chance of surviving losing a random two drives. So I > would describe the mirror solution as more resilient. > > So going to RAIDZ2 or even RAIDZ3 would be better, I agree.>From a data resiliency point, yes, raidz2 or raidz3 offers betterprotection. At a significant performance cost. Given enough drives, one could probably imagine using raidz3 underlying vdevs, with RAID0 striping to spread I/O across multiple spindles. I''m not sure how well this would perform, but I suspect it would perform better than straight raidz2/raidz3, but at a significant expense (you''d need a lot of drives).> > In an 8-bay chassis, there are other concerns, too. Do I keep space open > for a hot spare? There''s no real point in a hot spare if you have only > one vdev; that is, 8-drive RAIDZ3 is clearly better than 7-drive RAIDZ2 > plus a hot spare. And putting everything into one vdev means that for any > upgrade I have to replace all 8 drives at once, a financial problem for a > home server.This is one of the reasons I don''t advocate using raidz (any version) for home use, unless you can''t afford the cost in space represented by mirroring and a hot spare or two. (The other reason ... for my use at least... is the performance cost. I want to use my array to host compilation workspaces, and for that I would prefer to get the most performance out of my solution. I suppose I could add some SSDs... but I still think multiple spindles are a good option when you can do it.) In an 8 drive chassis, without any SSDs involved,I''d configure 6 of the drives as a 3 vdev stripe consisting of mirrors of 2 drives, and I''d leave the remaining two bays as hot spares. Btw, using the hot spares in this way potentially means you can use those bays later to upgrade to larger drives in the future, without offlining anything and without taking too much of a performance penalty when you do so.> > > If you''re clever, you''ll also try to make sure each side of the mirror > > is on a different controller, and if you have enough controllers > > available, you''ll also try to balance the controllers across stripes. > > I did manage to split the mirrors accross controllers (I have 6 SATA on > the motherboard and I added an 8-port SAS card with SAS-SATA cabling). > > > One way to help with that is to leave a drive or two available as a hot > > spare. > > > > Btw, the above recommendation mirrors what Jeff Bonwick himself (the > > creator of ZFS) has advised on his blog. > > I believe that article directly influenced my choice, in fact.Okay, good. :-) - Garrett
David Dyer-Bennet wrote:> My choice of mirrors rather than RAIDZ is based on > the fact that I have > only 8 hot-swap bays (I still think of this as LARGE > for a home server; > the competition, things like the Drobo, tends to have > 4 or 5), that I > don''t need really large amounts of storage (after my > latest upgrade I''m > running with 1.2TB of available data space), and that > I expected to need > to expand storage over the life of the system. With > mirror vdevs, I can > expand them without compromising redundancy even > temporarily, by attaching > the new drives before I detach the old drives; I > couldn''t do that with > RAIDZ. Also, the fact that disk is now so cheap > means that 100% > redundancy is affordable, I don''t have to compromise > on RAIDZ.Maybe I have been unlucky too many times doing storage admin in the 90s, but simple mirroring still scares me. Even with a hot spare (you do have one, right?) the rebuild window leaves the entire pool exposed to a single failure. One of the nice things about zfs is that allows, "to each his own." My home server''s main pool is 22x 73GB disks in a Sun A5000 configured as RAIDZ3. Even without a hot spare, it takes several failures to get the pool into trouble. At the same time, there are several downsides to a wide stripe like that, including relatively poor iops and longer rebuild windows. As noted above, until bp_rewrite arrives, I cannot change the geometry of a vdev, which kind of limits the flexibility. As a side rant, I still find myself baffled that Oracle/Sun correctly touts the benefits of zfs in the enterprise, including tremendous flexibility and simplicity of filesystem provisioning and nondisruptive changes to filesystems via properties. These forums are filled with people stating that the enterprise demands simple, flexibile and nondisruptive filesystem changes, but no enterprise cares about simple, flexibile and nondisruptive pool/vdev changes, e.g. changing a vdev geometry or evacuating a vdev. I can''t accept that zfs flexibility is critical and zpool flexibility is unwanted. -- This message posted from opensolaris.org
> If you''re clever, you''ll also try to make sure each side of the mirror > is on a different controller, and if you have enough controllers > available, you''ll also try to balance the controllers across stripes.Something like this ? # zpool status fibre0 pool: fibre0 state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using ''zpool upgrade''. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM fibre0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t16d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c2t17d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c2t18d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t20d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t21d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t19d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 spares c2t22d0 AVAIL errors: No known data errors However, unlike the bad old days of SVM ( DiskSuite or Solstice Disksuite or Online Disk Suite etc ) I have no idea what algorithm is used to pick the hot spare in the event of a failure. I mean, if I had more than one hotspare there of course. Also, I think the weird order of controllers is a user mistake on my part. Some of them have c5 listed first and others have c2 listed first. I don''t know if that matters at all however. I can add mirrors on the fly but I can not ( yet ) remove them. I would imagine that the algorithm to remove data from vdevs would be fairly gnarly. The item that I find somewhat confusing is how to apply multi-path fibre devices to a stripe of mirrors. Consider these : # mpathadm list lu /dev/rdsk/c4t20000004CF9B63D0d0s2 Total Path Count: 2 Operational Path Count: 2 /dev/rdsk/c4t20000004CFA4D655d0s2 Total Path Count: 2 Operational Path Count: 2 /dev/rdsk/c4t20000004CFA4D2D9d0s2 Total Path Count: 2 Operational Path Count: 2 /dev/rdsk/c4t20000004CFBFD4BDd0s2 Total Path Count: 2 Operational Path Count: 2 /dev/rdsk/c4t20000004CFA4D3A1d0s2 Total Path Count: 2 Operational Path Count: 2 /dev/rdsk/c4t20000004CFA4D2C7d0s2 Total Path Count: 2 Operational Path Count: 2 /scsi_vhci/ses at g50800200001ad5d8 Total Path Count: 2 Operational Path Count: 2 Here we have each disk device sitting on two fibre loops : # mpathadm show lu /dev/rdsk/c4t20000004CF9B63D0d0s2 Logical Unit: /dev/rdsk/c4t20000004CF9B63D0d0s2 mpath-support: libmpscsi_vhci.so Vendor: SEAGATE Product: ST373405FSUN72G Revision: 0438 Name Type: unknown type Name: 20000004cf9b63d0 Asymmetric: no Current Load Balance: round-robin Logical Unit Group ID: NA Auto Failback: on Auto Probing: NA Paths: Initiator Port Name: 21000003ba2cabc6 Target Port Name: 21000004cf9b63d0 Override Path: NA Path State: OK Disabled: no Initiator Port Name: 210100e08b24f056 Target Port Name: 22000004cf9b63d0 Override Path: NA Path State: OK Disabled: no Target Ports: Name: 21000004cf9b63d0 Relative ID: 0 Name: 22000004cf9b63d0 Relative ID: 0 This is not disk redundency but rather fibre path redundency. When I drop these guys into a ZPool it looks like this : NAME STATE READ WRITE CKSUM fp0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t20000004CFBFD4BDd0s0 ONLINE 0 0 0 c4t20000004CFA4D3A1d0s0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t20000004CFA4D2D9d0s0 ONLINE 0 0 0 c4t20000004CFA4D2C7d0s0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t20000004CFA4D655d0s0 ONLINE 0 0 0 c4t20000004CF9B63D0d0s0 ONLINE 0 0 0 So the manner in which any given IO transaction gets to the zfs filesystem just gets ever more complicated and convoluted and it makes me wonder if I am tossing away performance to get higher levels of safety. -- Dennis Clarke dclarke at opensolaris.ca <- Email related to the open source Solaris dclarke at blastwave.org <- Email related to open source for Solaris
On Thu, June 3, 2010 10:50, Marty Scholes wrote:> David Dyer-Bennet wrote: >> My choice of mirrors rather than RAIDZ is based on >> the fact that I have >> only 8 hot-swap bays (I still think of this as LARGE >> for a home server; >> the competition, things like the Drobo, tends to have >> 4 or 5), that I >> don''t need really large amounts of storage (after my >> latest upgrade I''m >> running with 1.2TB of available data space), and that >> I expected to need >> to expand storage over the life of the system. With >> mirror vdevs, I can >> expand them without compromising redundancy even >> temporarily, by attaching >> the new drives before I detach the old drives; I >> couldn''t do that with >> RAIDZ. Also, the fact that disk is now so cheap >> means that 100% >> redundancy is affordable, I don''t have to compromise >> on RAIDZ. > > Maybe I have been unlucky too many times doing storage admin in the 90s, > but simple mirroring still scares me. Even with a hot spare (you do have > one, right?) the rebuild window leaves the entire pool exposed to a single > failure.No hot spare currently. And now running on 4-year-old disks, too. For me, mirroring is a big step UP from bare single drives. That''s my "default state". Of course, I''m a big fan of multiple levels of backup.> One of the nice things about zfs is that allows, "to each his own." My > home server''s main pool is 22x 73GB disks in a Sun A5000 configured as > RAIDZ3. Even without a hot spare, it takes several failures to get the > pool into trouble.Yes, it''s very flexible, and while there are no doubt useless degenerate cases here and there, lots of the cases are useful for some environment or other. That does seem like rather an extreme configuration.> At the same time, there are several downsides to a wide stripe like that, > including relatively poor iops and longer rebuild windows. As noted > above, until bp_rewrite arrives, I cannot change the geometry of a vdev, > which kind of limits the flexibility.There are a LOT of reasons to want bp_rewrite, certainly.> As a side rant, I still find myself baffled that Oracle/Sun correctly > touts the benefits of zfs in the enterprise, including tremendous > flexibility and simplicity of filesystem provisioning and nondisruptive > changes to filesystems via properties. > > These forums are filled with people stating that the enterprise demands > simple, flexibile and nondisruptive filesystem changes, but no enterprise > cares about simple, flexibile and nondisruptive pool/vdev changes, e.g. > changing a vdev geometry or evacuating a vdev. I can''t accept that zfs > flexibility is critical and zpool flexibility is unwanted.We could certainly use that level of pool-equivalent flexibility at work; we don''t currently have it (not ZFS, not high-end enterprise storage units). -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On Jun 3, 2010, at 8:36 AM, Freddie Cash wrote:> On Wed, Jun 2, 2010 at 8:10 PM, Roman Naumenko <roman at naumenko.ca> wrote: > Well, I explained it not very clearly. I meant the size of a raidz array can''t be changed. > For sure zpool add can do the job with a pool. Not with a raidz configuration. > > You can''t increase the number of drives in a raidz vdev, no. Going from a 4-drive raidz1 to a 5-drive raidz1 is currently impossible. And going from a raidz1 to a raidz2 vdev is currently impossible. On the flip side, it''s rare to find a hardware RAID controller that allows this.AFAIK, and someone please correct me, the only DIY/FOSS RAID implementation that allows incremental growing of RAID-5 is LVM. Of course, that means you''re stuck with the RAID-5 write hole. TANSTAAFL. -- richard -- Richard Elling richard at nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/
On Thu, June 3, 2010 10:50, Garrett D''Amore wrote:> On Thu, 2010-06-03 at 10:35 -0500, David Dyer-Bennet wrote: >> On Thu, June 3, 2010 10:15, Garrett D''Amore wrote: >> > Using a stripe of mirrors (RAID0) you can get the benefits of multiple >> > spindle performance, easy expansion support (just add new mirrors to >> the >> > end of the raid0 stripe), and 100% data redundancy. If you can >> afford >> > to pay double for your storage (the cost of mirroring), this is IMO >> the >> > best solution. >> >> Referencing "RAID0" here in the context of ZFS is confusing, though. >> Are >> you suggesting using underlying RAID hardware to create virtual volumes >> to >> then present to ZFS, or what? > > RAID0 is basically the default configuration of a ZFS pool -- its a > concatenation of the underlying vdevs. In this case the vdevs should > themselves be two-drive mirrors. > > This of course has to be done in the ZFS layer, and ZFS doesn''t call it > RAID0, any more than it calls a mirror RAID1, but effectively that''s > what they are.Kinda mostly, anyway. I thought we recently had this discussion, and people were pointing out things like the striping wasn''t physically the same on each drive and such.>> > Note that this solution is not quite as resilient against hardware >> > failure as raidz2 or raidz3. While the RAID1+0 solution can tolerate >> > multiple drive failures, if both both drives in a mirror fail, you >> lose >> > data. >> >> In a RAIDZ solution, two or more drive failures lose your data. In a >> mirrored solution, losing the WRONG two drives will still lose your >> data, >> but you have some chance of surviving losing a random two drives. So I >> would describe the mirror solution as more resilient. >> >> So going to RAIDZ2 or even RAIDZ3 would be better, I agree. > >>From a data resiliency point, yes, raidz2 or raidz3 offers better > protection. At a significant performance cost.The place I care about performance is almost entirely sequential read/write -- loading programs, and loading and saving large image files. I don''t know a lot of home users that actually need high IOPS.> Given enough drives, one could probably imagine using raidz3 underlying > vdevs, with RAID0 striping to spread I/O across multiple spindles. I''m > not sure how well this would perform, but I suspect it would perform > better than straight raidz2/raidz3, but at a significant expense (you''d > need a lot of drives).Might well work that way; it does sound about right.>> In an 8-bay chassis, there are other concerns, too. Do I keep space >> open >> for a hot spare? There''s no real point in a hot spare if you have only >> one vdev; that is, 8-drive RAIDZ3 is clearly better than 7-drive RAIDZ2 >> plus a hot spare. And putting everything into one vdev means that for >> any >> upgrade I have to replace all 8 drives at once, a financial problem for >> a >> home server. > > This is one of the reasons I don''t advocate using raidz (any version) > for home use, unless you can''t afford the cost in space represented by > mirroring and a hot spare or two. (The other reason ... for my use at > least... is the performance cost. I want to use my array to host > compilation workspaces, and for that I would prefer to get the most > performance out of my solution. I suppose I could add some SSDs... but > I still think multiple spindles are a good option when you can do it.) > > In an 8 drive chassis, without any SSDs involved,I''d configure 6 of the > drives as a 3 vdev stripe consisting of mirrors of 2 drives, and I''d > leave the remaining two bays as hot spares. Btw, using the hot spares > in this way potentially means you can use those bays later to upgrade to > larger drives in the future, without offlining anything and without > taking too much of a performance penalty when you do so.And the three 2-way mirrors is exactly where I am right now. I don''t have hot spares in place, but I have the bays reserved for that use. In the latest upgrade, I added 4 2.5" hot-swap bays (which got the system disks out of the 3.5" hot-swap bays). I have two free, and that''s the form-factor SSDs come in these days, so if I thought it would help I could add an SSD there. Have to do quite a bit of research to see which uses would actually benefit me, and how much. It''s not obvious that either l2arc or zil on SSD would help my program loading, image file loading, or image file saving cases that much. There may be more other stuff than I really think of though. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On Thu, 3 Jun 2010, David Dyer-Bennet wrote:> > In an 8-bay chassis, there are other concerns, too. Do I keep space open > for a hot spare? There''s no real point in a hot spare if you have only > one vdev; that is, 8-drive RAIDZ3 is clearly better than 7-drive RAIDZ2 > plus a hot spare. And putting everything into one vdev means that for any > upgrade I have to replace all 8 drives at once, a financial problem for a > home server.It is not so clear to me that an 8-drive raidz3 is clearly better than 7-drive raidz2 plus a hot spare. From a maintenance standpoint, I think that it is useful to have a spare drive or even an empty spare slot so that it is easy to replace a drive without needing to physically remove it from the system. A true hot spare allows replacement to start automatically right away if a failure is detected. With only 8-drives, the reliability improvement from raidz3 is unlikely to be borne out in practice. Other potential failures modes will completely drown out the on-paper reliability improvement provided by raidz3. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Thu, 2010-06-03 at 12:03 -0500, Bob Friesenhahn wrote:> On Thu, 3 Jun 2010, David Dyer-Bennet wrote: > > > > In an 8-bay chassis, there are other concerns, too. Do I keep space open > > for a hot spare? There''s no real point in a hot spare if you have only > > one vdev; that is, 8-drive RAIDZ3 is clearly better than 7-drive RAIDZ2 > > plus a hot spare. And putting everything into one vdev means that for any > > upgrade I have to replace all 8 drives at once, a financial problem for a > > home server. > > It is not so clear to me that an 8-drive raidz3 is clearly better than > 7-drive raidz2 plus a hot spare. From a maintenance standpoint, I > think that it is useful to have a spare drive or even an empty spare > slot so that it is easy to replace a drive without needing to > physically remove it from the system. A true hot spare allows > replacement to start automatically right away if a failure is > detected. > > With only 8-drives, the reliability improvement from raidz3 is > unlikely to be borne out in practice. Other potential failures modes > will completely drown out the on-paper reliability improvement > provided by raidz3.I tend to concur. I think that raidz3 is primarily useful in situations with either an extremely large number of drives (very large arrays), or in situations calling for extremely high fault tolerance (think loss-of-life kinds of applications, or wall-street trading house applications where downtime is measured in millions of dollars per minute.) And in those situations where raidz3 is called for, I think you still want some pool of hot spares. (I''m thinking of the kinds of deployments where the failure rate of drives approaches the ability of the site to replace them quickly enough -- think very very large data centers with hundreds or even thousands of drives.) raidz3 is not, I think, for the typical home user, or even the typical workgroup server application. I think I''d prefer raidz with hot spare(s) over raidz2, even, for a typical situation. But I view raidz in all its forms as a kind of compromise between redundancy, performance, and capacity -- sort of a jack of all trades and master of none. With $/Gb as low as they are today, I would be hard pressed to recommend any of the raidz configurations except in applications calling for huge amounts of data with no real performance requirements (nearline backup kinds of applications) and no requirements for expandability. (Situations where expansion is resolved by purchasing new arrays, rather than growing storage within an array.) -- Garrett> > Bob
On Thu, 2010-06-03 at 08:50 -0700, Marty Scholes wrote:> Maybe I have been unlucky too many times doing storage admin in the 90s, but simple mirroring still scares me. Even with a hot spare (you do have one, right?) the rebuild window leaves the entire pool exposed to a single failure. > > One of the nice things about zfs is that allows, "to each his own." My home server''s main pool is 22x 73GB disks in a Sun A5000 configured as RAIDZ3. Even without a hot spare, it takes several failures to get the pool into trouble.Perhaps you have been unlucky. Certainly, there is a window with N+1 redundancy where a single failure leaves the system exposed in the face of a 2nd fault. This is a statistics game... Mirrors made up of multiple drives are of course substantially more risky than mirrors made of just drive pairs. I would strongly discourage multiple drive mirrors unless the devices underneath the mirror are somehow configured in a way that provides additional tolerance. Such as a mirror of raidz devices. Although, such a configuration would be a poor choice, since you''d take a big performance penalty. Of course, you can have more than a two-way mirror, at substantial increased cost. So you balance your needs. RAIDZ2 and RAIDZ3 give N+2 and N+3 fault tolerance, and represent a compromise weighted to fault tolerance and capacity, at a significant penalty to performance (and as noted, the ability to increase capacity). There certainly are applications where this is appropriate. I doubt most home users fall into that category. Given a relatively small number of spindles (the 8 that was quoted), I prefer RAID 1+0 with hot spares. If I can invest in 8 drives, with 1TB drives I can balance I/O across 3 spindles, get 3TB of storage, have N +1.x tolerance (N+1, plus the ability to take up to two more faults as long as they do not occur in the same pair of mirrored drives), and I can easily grow to larger drives (for example the forthcoming 3TB drives) when need and cost make that move appropriate. -- Garrett
On Thu, 2010-06-03 at 12:22 -0400, Dennis Clarke wrote:> > If you''re clever, you''ll also try to make sure each side of the mirror > > is on a different controller, and if you have enough controllers > > available, you''ll also try to balance the controllers across stripes. > > Something like this ? > > # zpool status fibre0 > pool: fibre0 > state: ONLINE > status: The pool is formatted using an older on-disk format. The pool can > still be used, but some features are unavailable. > action: Upgrade the pool using ''zpool upgrade''. Once this is done, the > pool will no longer be accessible on older software versions. > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > fibre0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c2t16d0 ONLINE 0 0 0 > c5t0d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c5t1d0 ONLINE 0 0 0 > c2t17d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c5t2d0 ONLINE 0 0 0 > c2t18d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c2t20d0 ONLINE 0 0 0 > c5t4d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c2t21d0 ONLINE 0 0 0 > c5t6d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c2t19d0 ONLINE 0 0 0 > c5t5d0 ONLINE 0 0 0 > spares > c2t22d0 AVAIL > > errors: No known data errorsThat looks like a good configuration to me!> > However, unlike the bad old days of SVM ( DiskSuite or Solstice Disksuite > or Online Disk Suite etc ) I have no idea what algorithm is used to pick > the hot spare in the event of a failure. I mean, if I had more than one > hotspare there of course. Also, I think the weird order of controllers is > a user mistake on my part. Some of them have c5 listed first and others > have c2 listed first. I don''t know if that matters at all however.I don''t think the order matters. It certainly won''t make a difference for write, since you have to use both sides of the mirror. It *could* make a difference for reading... but I suspect that zfs will try to sufficiently balance things out that any difference in ordering will be lost in the noise. The hotspare replacement shouldn''t matter all that much... except when you''re using them to upgrade to bigger drives. (Then just use a single hot spare to force the selection.)> > I can add mirrors on the fly but I can not ( yet ) remove them. I would > imagine that the algorithm to remove data from vdevs would be fairly > gnarly.Indeed. However, you shouldn''t need to remove vdevs. With redundancy, you can resilver to a hot spare, and then remove the drive, but ultimately, you can''t condense the data onto fewer drives. You have to accept when you configure your array that, modulo hot spares, your pool will always consume the same number of spindles.> > The item that I find somewhat confusing is how to apply multi-path fibre > devices to a stripe of mirrors. Consider these : >> > This is not disk redundency but rather fibre path redundency. When I drop > these guys into a ZPool it looks like this : > > NAME STATE READ WRITE CKSUM > fp0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t20000004CFBFD4BDd0s0 ONLINE 0 0 0 > c4t20000004CFA4D3A1d0s0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t20000004CFA4D2D9d0s0 ONLINE 0 0 0 > c4t20000004CFA4D2C7d0s0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t20000004CFA4D655d0s0 ONLINE 0 0 0 > c4t20000004CF9B63D0d0s0 ONLINE 0 0 0The above configuration looks good to me.> > So the manner in which any given IO transaction gets to the zfs filesystem > just gets ever more complicated and convoluted and it makes me wonder if I > am tossing away performance to get higher levels of safety.If you''re using multipathing, then you get path load balancing automatically, and you can pretty much ignore the controller balancing issue, as long as you use the mpxio (scsi_vhci) path. mpxio should take care of ensuring that I/O is balanced across ports for you; you just need to make sure that you are balancing *spindles* properly. -- Garrett
On Thu, 2010-06-03 at 11:49 -0500, David Dyer-Bennet wrote:> hot spares in place, but I have the bays reserved for that use. > > In the latest upgrade, I added 4 2.5" hot-swap bays (which got the system > disks out of the 3.5" hot-swap bays). I have two free, and that''s the > form-factor SSDs come in these days, so if I thought it would help I could > add an SSD there. Have to do quite a bit of research to see which uses > would actually benefit me, and how much. It''s not obvious that either > l2arc or zil on SSD would help my program loading, image file loading, or > image file saving cases that much. There may be more other stuff than I > really think of though.It really depends on the working sets these programs deal with. zil is useful primarily when doing lots of writes, especially lots of writes to small files or to data scattered throughout a file. I view it as a great solution for database acceleration, and for accelerating the filesystems I use for hosting compilation workspaces. (In retrospect, since by definition the results of compilation are reproducible, maybe I should just turn off synchronous writes for build workspaces... provided that they do not contain any modifications to the sources themselves. I''m going to have to play with this.) l2arc is useful for data that is read back frequently but is too large to fit in buffer cache. I can imagine that it would be useful for hosting storage associated with lots of programs that are called frequently. You can think of it as a logical extension of the buffer cache in this regard... if your working set doesn''t fit in RAM, then l2arc can prevent going back to rotating media. All other things being equal, I''d increase RAM before I''d worry too much about l2arc. The exception to that would be if I knew I had working sets that couldn''t possibly fit in RAM... 160GB of SSD is a *lot* cheaper than 160GB of RAM. :-) - Garrett
On Thu, June 3, 2010 13:04, Garrett D''Amore wrote:> On Thu, 2010-06-03 at 11:49 -0500, David Dyer-Bennet wrote: >> hot spares in place, but I have the bays reserved for that use. >> >> In the latest upgrade, I added 4 2.5" hot-swap bays (which got the >> system >> disks out of the 3.5" hot-swap bays). I have two free, and that''s the >> form-factor SSDs come in these days, so if I thought it would help I >> could >> add an SSD there. Have to do quite a bit of research to see which uses >> would actually benefit me, and how much. It''s not obvious that either >> l2arc or zil on SSD would help my program loading, image file loading, >> or >> image file saving cases that much. There may be more other stuff than I >> really think of though. > > It really depends on the working sets these programs deal with. > > zil is useful primarily when doing lots of writes, especially lots of > writes to small files or to data scattered throughout a file. I view it > as a great solution for database acceleration, and for accelerating the > filesystems I use for hosting compilation workspaces. (In retrospect, > since by definition the results of compilation are reproducible, maybe I > should just turn off synchronous writes for build workspaces... provided > that they do not contain any modifications to the sources themselves. > I''m going to have to play with this.)I suspect there are more cases here than I immediately think of. For example, sitting here thinking, I wonder if the web cache would benefit a lot? And all those email files? RAW files from my camera are 12-15MB, and the resulting Photoshop files are around 50MB (depending on compression, and they get bigger fast if I add layers). Those aren''t small, and I don''t read the same thing over and over lots. For build spaces, definitely should be reproducible from source. A classic production build starts with checking out a tagged version from source control, and builds from there.> l2arc is useful for data that is read back frequently but is too large > to fit in buffer cache. I can imagine that it would be useful for > hosting storage associated with lots of programs that are called > frequently. You can think of it as a logical extension of the buffer > cache in this regard... if your working set doesn''t fit in RAM, then > l2arc can prevent going back to rotating media.I don''t think I''m going to benefit much from this.> All other things being equal, I''d increase RAM before I''d worry too much > about l2arc. The exception to that would be if I knew I had working > sets that couldn''t possibly fit in RAM... 160GB of SSD is a *lot* > cheaper than 160GB of RAM. :-)I just did increase RAM, same upgrade as the 2.5" bays and the additional controller and the third mirrored vdev. I increased it all the way to 4GB! And I can''t increase it further feasibly (4GB sticks of ECC RAM being hard to find and extremely pricey; plus I''d have to displace some of my existing memory). Since this is a 2006 system, in another couple of years it''ll be time to replace MB and processor and memory, and I''m sure it''ll have a lot more memory next time. I''m desperately waiting for Solaris 2006.$Q2 ("Q2" since it was pointed out last time that "Spring" was wrong on half the Earth), since I hope it will resolve my backup problems so I can get incremental backups happening nightly (intention is to use zfs send/receive with incremental replication streams, to keep external drives up-to-date with data and all snapshots). The oldness of the system and especially the drives makes this more urgent, though of course it''s important in general. I do manage a full backup that completes now and then, anyway, and they''ll complete overnight if they don''t hang. Problem is, if they hang, have to reboot the Solaris box and every Windows box using it. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On Thu, June 3, 2010 12:03, Bob Friesenhahn wrote:> On Thu, 3 Jun 2010, David Dyer-Bennet wrote: >> >> In an 8-bay chassis, there are other concerns, too. Do I keep space >> open >> for a hot spare? There''s no real point in a hot spare if you have only >> one vdev; that is, 8-drive RAIDZ3 is clearly better than 7-drive RAIDZ2 >> plus a hot spare. And putting everything into one vdev means that for >> any >> upgrade I have to replace all 8 drives at once, a financial problem for >> a >> home server. > > It is not so clear to me that an 8-drive raidz3 is clearly better than > 7-drive raidz2 plus a hot spare. From a maintenance standpoint, I > think that it is useful to have a spare drive or even an empty spare > slot so that it is easy to replace a drive without needing to > physically remove it from the system. A true hot spare allows > replacement to start automatically right away if a failure is > detected.But is having a RAIDZ2 drop to single redundancy, with replacement starting instantly, actually as good or better than having a RAIDZ3 drop to double redundancy, with actual replacement happening later? The "degraded" state of the RAIDZ3 has the same redundancy as the "healthy" state of the RAIDZ2. Certainly having a spare drive bay to play with is often helpful; though the scenarios that most immediately spring to mind are all mirror-related and hence don''t apply here.> With only 8-drives, the reliability improvement from raidz3 is > unlikely to be borne out in practice. Other potential failures modes > will completely drown out the on-paper reliability improvement > provided by raidz3.I wouldn''t give up much of anything to add Z3 on 8 drives, no. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On 6/2/10 11:10 PM -0400 Roman Naumenko wrote:> Well, I explained it not very clearly. I meant the size of a raidz array > can''t be changed. > For sure zpool add can do the job with a pool. Not with a raidz > configuration.Well in that case it''s invalid to compare against Netapp since they can''t do it either (seems to be the consensus on this list). Neither zfs nor Netapp (nor any product) is really designed to handle adding one drive at a time. Normally you have to add an entire shelf, and if you''re doing that it''s better to add a new vdev to your pool.
On 6/3/10 8:45 AM +0200 Juergen Nickelsen wrote:> Richard Elling <richard at nexenta.com> writes: > >>> And some time before I had suggested to a my buddy zfs for his new >>> home storage server, but he turned it down since there is no >>> expansion available for a pool. >> >> Heck, let him buy a NetApp :-) > > Definitely a possibility, given the availability and pricing of > oldish NetApp hardware on eBay.Not really. Software license is invalid on resale, and you can''t replace a failed drive with a generic drive so at some point you must buy an Ontap license = $$$.
On 6/3/10 12:06 AM -0400 Roman Naumenko wrote:> I think there is a difference. Just quickly checked netapp site: > > Adding new disks to a RAID group If a volume has more than one RAID > group, you can specify the RAID group to which you are adding disks.hmm that''s a surprising feature to me. I remember, and this was a few years back but I don''t see why it would be any different now, we were trying to add drives 1-2 at a time to medium-sized arrays (don''t buy the disks until we need them, to hold onto cash), and the Netapp performance kept going down down down. We eventually had to borrow an array from Netapp to copy our data onto to rebalance. Netapp told us explicitly, make sure to add an entire shelf at a time (and a new raid group, obviously, don''t extend any existing group).
frank+lists/zfs at linetwo.net said:> Well in that case it''s invalid to compare against Netapp since they can''t do > it either (seems to be the consensus on this list). Neither zfs nor Netapp > (nor any product) is really designed to handle adding one drive at a time. > Normally you have to add an entire shelf, and if you''re doing that it''s > better to add a new vdev to your pool.This is incorrect (and another poster has pointed this out). NetApp can add a single drive (or more) to a raid-group, and has been able to do so since before they had dual-parity, aggregates, flex-vols, and rebalancing. BTW, the rebalance after growing an aggregate is not automatic (as of OnTAP-7.3 anyway). You invoke a command manually on each volume that you care about, and the rebalance runs in the background until finished. Regards, Marion
On Thu, Jun 03, 2010 at 12:40:34PM -0700, Frank Cusack wrote:> On 6/3/10 12:06 AM -0400 Roman Naumenko wrote: > >I think there is a difference. Just quickly checked netapp site: > > > >Adding new disks to a RAID group If a volume has more than one RAID > >group, you can specify the RAID group to which you are adding disks. > > hmm that''s a surprising feature to me.It''s always been possible with Netapp. Back in the pre-5.0 (maybe it was pre-4.0) days, an OnTAP device only had one raid group and one filesystem/volume. All you could do was expand it, not add additional raid groups or additional volumes. When the other features were added, the ability to expand a raid group was not removed.> I remember, and this was a few years back but I don''t see why it would > be any different now, we were trying to add drives 1-2 at a time to > medium-sized arrays (don''t buy the disks until we need them, to hold > onto cash), and the Netapp performance kept going down down down. We > eventually had to borrow an array from Netapp to copy our data onto > to rebalance. Netapp told us explicitly, make sure to add an entire > shelf at a time (and a new raid group, obviously, don''t extend any > existing group).Yup, that''s absolutely the best way to do it. Otherwise, all your writes will be on the one or two new disks creating hotspots until you can rebalance your data, and that could take a long time. I''m pretty sure in the distant past, they had no online rebalancer. Nowadays there is one, but it''s not particularly speedy. Think adding a mirror pair to a large, nearly-full zpool. The same thing will happen. -- Darren
frank+lists/zfs at linetwo.net said:> I remember, and this was a few years back but I don''t see why it would be any > different now, we were trying to add drives 1-2 at a time to medium-sized > arrays (don''t buy the disks until we need them, to hold onto cash), and the > Netapp performance kept going down down down. We eventually had to borrow an > array from Netapp to copy our data onto to rebalance. Netapp told us > explicitly, make sure to add an entire shelf at a time (and a new raid group, > obviously, don''t extend any existing group).The advent of aggregates fixed that problem. Used to be that a raid-group belonged to only one volume. Now multiple flex-vols (even tiny ones) share all the spindles (and parity drives) on their aggregate, and you can rebalance after adding drives without having to manually move/copy existing data. Pretty slick, if you can afford the price. Regards, Marion
On Jun 3, 2010, at 3:16 AM, Erik Trimble wrote:> Expanding a RAIDZ (which, really, is the only thing that can''t do right now, w/r/t adding disks) requires the Block Pointer (BP) Rewrite functionality before it can get implemented.Strictly speaking BP rewrite is not required to expand a RAID-Z, though it is required to be able to rewrite all blocks on the expanded VDEV so that newly attached space is usable. regards victor
On Thu, 3 Jun 2010, David Dyer-Bennet wrote:> > But is having a RAIDZ2 drop to single redundancy, with replacement > starting instantly, actually as good or better than having a RAIDZ3 drop > to double redundancy, with actual replacement happening later? The > "degraded" state of the RAIDZ3 has the same redundancy as the "healthy" > state of the RAIDZ2.Mathematically, I am sure that raidz3 is better. Redundancy statistics are not the only consideration though. Raidz3 will write slower and resilver slower. If the power supply produces a surge and fries all the drives, then raidz3 wil not help more than raidz2. Once the probability of failure due to unrelated drive failures becomes small enough, other factors related to the system become the dominant ones. The power supply could surge, memory can return wrong data (even with ECC), the OS kernel can have a bug, or a tree can fall on the computer during a storm. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Jun 3, 2010, at 13:36, Garrett D''Amore wrote:> Perhaps you have been unlucky. Certainly, there is a window with N > +1 redundancy where a single failure leaves the system exposed in > the face of a 2nd fault. This is a statistics game...It doesn''t even have to be a drive failure, but an unrecoverable read error.
On Jun 3, 2010 7:35 PM, David Magda wrote:> On Jun 3, 2010, at 13:36, Garrett D''Amore wrote: > > > Perhaps you have been unlucky. Certainly, there is > a window with N > > +1 redundancy where a single failure leaves the > system exposed in > > the face of a 2nd fault. This is a statistics > game... > > It doesn''t even have to be a drive failure, but an > unrecoverable read > error.Well said. Also include a controller burp, a bit flip somewhere, a drive going offline briefly, fibre cable momentary interruption, etc. The list goes on. My experience is that these weirdo "once in a lifetime" issues tend to present in clumps which are not as evenly distributred as statistics would lead you to believe. Rather, like my kids, they save up their fun into coordinated bursts. When these bursts happen, the ensuing conversations with stakeholders about how all of this "redundancy" you tricked them into purchasing has left them exposed. Not good times. -- This message posted from opensolaris.org
In addition to all comments below, 7000 series which are competing with NetApp boxes have the ability to add more storage to the pool in a couple seconds, online and does load balancing automaticaly. Also we dont have the 16 TB limit NetApp has. Nearly all customers did tihs without any PS involvement. Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +902123352222 Email mertol.ozyoney at sun.com -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Richard Elling Sent: Thursday, June 03, 2010 3:51 AM To: Roman Naumenko Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] one more time: pool size changes On Jun 2, 2010, at 3:54 PM, Roman Naumenko wrote:> Recently I talked to a co-worker who manages NetApp storages. We discussedsize changes for pools in zfs and aggregates in NetApp.> > And some time before I had suggested to a my buddy zfs for his new homestorage server, but he turned it down since there is no expansion available for a pool. Heck, let him buy a NetApp :-)> And he really wants to be able to add a drive or couple to an existingpool. Yes, there are ways to expand storage to some extent without rebuilding it. Like replacing disk with larger ones. Not enough for a typical home user I would say. Why not? I do this quite often. Growing is easy, shrinking is more challenging.> And this is might be an important for corporate too. Frankly speaking Idoubt there are many administrators use it in DC environment.> > Nevertheless, NetApp appears to have such feature as I learned from myco-worker. It works with some restrictions (you have to zero disks before adding, and rebalance the aggregate after and still without perfect distribution) - but Ontap is able to do aggregates expansion nevertheless.> > So, my question is: what does prevent to introduce the same for zfs atpresent time? Is this because of the design of zfs, or there is simply no demand for it in community? Its been there since 2005: zpool subcommand add. -- richard> > My understanding is that at present time there are no plans to introduceit.> > --Regards, > Roman Naumenko > roman at naumenko.com-- Richard Elling richard at nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/ _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss