Hi. I''m looking for the best solution to create an expandable heterogeneous pool of drives. I think in an ideal world, there''d be a raid version which could cleverly handle both multiple drive sizes and the addition of new drives into a group (so one could drop in a new drive of arbitrary size, maintain some redundancy, and gain most of that drive''s capacity), but my impression is that we''re far from there. Absent that, I was considering using zfs and just having a single pool. My main question is this: what is the failure mode of zfs if one of those drives either fails completely or has errors? Do I permanently lose access to the entire pool? Can I attempt to read other data? Can I "zfs replace" the bad drive and get some level of data recovery? Otherwise, by pooling drives am I simply increasing the probability of a catastrophic data loss? I apologize if this is addressed elsewhere -- I''ve read a bunch about zfs, but not come across this particular answer. As a side-question, does anyone have a suggestion for an intelligent way to approach this goal? This is not mission-critical data, but I''d prefer not to make data loss _more_ probable. Perhaps some volume manager (like LVM on linux) has appropriate features? Thanks for any help. -puk This message posted from opensolaris.org
Jef Pearlman wrote:> Hi. I''m looking for the best solution to create an expandable heterogeneous pool of drives. I think in an ideal world, there''d be a raid version which could cleverly handle both multiple drive sizes and the addition of new drives into a group (so one could drop in a new drive of arbitrary size, maintain some redundancy, and gain most of that drive''s capacity), but my impression is that we''re far from there.Mirroring (aka RAID-1, though technically more like RAID-1+0) in ZFS will do this.> Absent that, I was considering using zfs and just having a single pool. My main question is this: what is the failure mode of zfs if one of those drives either fails completely or has errors? Do I permanently lose access to the entire pool? Can I attempt to read other data? Can I "zfs replace" the bad drive and get some level of data recovery? Otherwise, by pooling drives am I simply increasing the probability of a catastrophic data loss? I apologize if this is addressed elsewhere -- I''ve read a bunch about zfs, but not come across this particular answer.We generally recommend a single pool, as long as the use case permits. But I think you are confused about what a zpool is. I suggest you look at the examples or docs. A good overview is the slide show http://www.opensolaris.org/os/community/zfs/docs/zfs_last.pdf> As a side-question, does anyone have a suggestion for an intelligent way to approach this goal? This is not mission-critical data, but I''d prefer not to make data loss _more_ probable. Perhaps some volume manager (like LVM on linux) has appropriate features?ZFS, mirrored pool will be the most performant and easiest to manage with better RAS than a raidz pool. -- richard
> Jef Pearlman wrote: > > Absent that, I was considering using zfs and just > > having a single pool. My main question is this: what > > is the failure mode of zfs if one of those drives > > either fails completely or has errors? Do I > > permanently lose access to the entire pool? Can I > > attempt to read other data? Can I "zfs replace" the > > bad drive and get some level of data recovery? > > Otherwise, by pooling drives am I simply increasing > > the probability of a catastrophic data loss? I > > apologize if this is addressed elsewhere -- I''ve read > > a bunch about zfs, but not come across this > > particular answer. > > We generally recommend a single pool, as long as the > use case permits. > But I think you are confused about what a zpool is. > I suggest you look > t the examples or docs. A good overview is the slide > show > http://www.opensolaris.org/os/community/zfs/docs/zfs_ > last.pdfPerhaps I''m not asking my question clearly. I''ve already experimented a fair amount with zfs, including creating and destroying a number of pools with and without redundancy, replacing vdevs, etc. Maybe asking by example will clarify what I''m looking for or where I''ve missed the boat. The key is that I want a grow-as-you-go heterogenous set of disks in my pool: Let''s say I start with a 40g drive and a 60g drive. I create a non-redundant pool (which will be 100g). At some later point, I run across an unused 30g drive, which I add to the pool. Now my pool is 130g. At some point after that, the 40g drive fails, either by producing read errors or my failing to spin up at all. What happens to my pool? Can I mount and access it at all (for the data not on or striped across the 40g drive)? Can I "zfs replace" the 40g drive with another drive and have it attempt to copy as much data over as it can? Or am I just out of luck? zfs seems like a great way to use old/unutilized drives to expand capacity, but sooner or later one of those drives will fail, and if it takes out the whole pool (which it might reasonably do), then it doesn''t work out in the end.> > As a side-question, does anyone have a suggestion > > for an intelligent way to approach this goal? This is > > not mission-critical data, but I''d prefer not to make > > data loss _more_ probable. Perhaps some volume > > manager (like LVM on linux) has appropriate features? > > ZFS, mirrored pool will be the most performant and > easiest to manage > with better RAS than a raidz pool.The problem I''ve come across with using mirror or raidz for this setup is that (as far as I know) you can''t add disks to mirror/raidz groups, and if you just add the disk to the pool, you end up in the same situation as above (with more space but no redundancy). Thanks for your help. -Jef This message posted from opensolaris.org
> Perhaps I''m not asking my question clearly. I''ve already experimented > a fair amount with zfs, including creating and destroying a number of > pools with and without redundancy, replacing vdevs, etc. Maybe asking > by example will clarify what I''m looking for or where I''ve missed the > boat. The key is that I want a grow-as-you-go heterogenous set of > disks in my pool:> Let''s say I start with a 40g drive and a 60g drive. I create a > non-redundant pool (which will be 100g). At some later point, I run > across an unused 30g drive, which I add to the pool. Now my pool is > 130g. At some point after that, the 40g drive fails, either by > producing read errors or my failing to spin up at all. What happens to > my pool?Since you have created a non-redundant pool (or more specifically, a pool with non-redundant members), the pool will fail.> The problem I''ve come across with using mirror or raidz for this setup > is that (as far as I know) you can''t add disks to mirror/raidz groups, > and if you just add the disk to the pool, you end up in the same > situation as above (with more space but no redundancy).You can''t add to an existing mirror, but you can add new mirrors (or raidz) items to the pool. If so, there''s no loss of redundancy. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
Darren Dunham wrote:>> The problem I''ve come across with using mirror or raidz for this setup >> is that (as far as I know) you can''t add disks to mirror/raidz groups, >> and if you just add the disk to the pool, you end up in the same >> situation as above (with more space but no redundancy). > > You can''t add to an existing mirror, but you can add new mirrors (or > raidz) items to the pool. If so, there''s no loss of redundancy.Maybe I''m missing some context, but you can add to an existing mirror - see zpool attach. Neil.
> Darren Dunham wrote: > >> The problem I''ve come across with using mirror or raidz for this setup > >> is that (as far as I know) you can''t add disks to mirror/raidz groups, > >> and if you just add the disk to the pool, you end up in the same > >> situation as above (with more space but no redundancy). > > > > You can''t add to an existing mirror, but you can add new mirrors (or > > raidz) items to the pool. If so, there''s no loss of redundancy. > > Maybe I''m missing some context, but you can add to an existing mirror > - see zpool attach.It depends on what you mean by "add". :-) The original message was about increasing storage allocation. You can add redundancy to an existing mirror with attach, but you cannot increase the allocatable storage. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
On Wed, 2007-06-27 at 14:50 -0700, Darren Dunham wrote:> > Darren Dunham wrote: > > >> The problem I''ve come across with using mirror or raidz for this setup > > >> is that (as far as I know) you can''t add disks to mirror/raidz groups, > > >> and if you just add the disk to the pool, you end up in the same > > >> situation as above (with more space but no redundancy). > > > > > > You can''t add to an existing mirror, but you can add new mirrors (or > > > raidz) items to the pool. If so, there''s no loss of redundancy. > > > > Maybe I''m missing some context, but you can add to an existing mirror > > - see zpool attach. > > It depends on what you mean by "add". :-) > > The original message was about increasing storage allocation. You can > add redundancy to an existing mirror with attach, but you cannot > increase the allocatable storage. >With mirrors, there is currently more flexibility than with raid-Z[2]. You can increase the allocatable storage size by replacing each disk in the mirror with a larger sized one (assuming you wait for a re-sync ;-P ) Thus, the _safe_ way to increase a mirrored vdev''s size is: Disk A: 100GB Disk B: 100GB Disk C: 250GB Disk D: 250GB zpool create tank mirror A B (yank out A, put in C) (wait for resync) (yank out B, put in D) (wait for resync) and voila! tank goes from 100GB to 250GB of space. I believe this should also work if LUNs are used instead of actual disks - but I don''t believe that resizing a LUN currently in a mirror will work (please, correct me on this), so, for a SAN-backed ZFS mirror, it would be: Assuming A = B < C, and after resizing A, A = C > B zpool create tank mirror A B zpool attach tank A C (where C is a new LUN of the new size desired) (wait for sync of C) zpool detach tank A (unmap LUN A from host, resize A to be the same as C, then map back) zpool attach C A (wait for sync of A) zpool detach B I believe that will now result in a mirror of the full size of C, not of B. I''d be interested to know if you could do this: zpool create tank mirror A B (resize LUN A and B to new size) without requiring a system reboot after resizing A & B (that is, the reboot would be needed to update the new LUN size on the host). -- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Jef Pearlman wrote:> Perhaps I''m not asking my question clearly. I''ve already experimented a fair amount > with zfs, including creating and destroying a number of pools with and without > redundancy, replacing vdevs, etc. Maybe asking by example will clarify what I''m > looking for or where I''ve missed the boat. The key is that I want a grow-as-you-go > heterogenous set of disks in my pool:The short answer: zpool add -- add a top-level vdev as a dynamic stripe column + available space is increased zpool attach -- add a mirror to an existing vdev + only works when the new mirror is the same size or larger than the existing vdev + available space is unchanged + redundancy (RAS) is increased zpool detach -- remove a mirror from an existing vdev + available space increases if removed mirror is smaller than vdev + redundancy (RAS) is decreased zpool replace -- functionally equivalent to attach followed by detach> Let''s say I start with a 40g drive and a 60g drive. I create a non-redundant pool > (which will be 100g). At some later point, I run across an unused 30g drive, which > I add to the pool. Now my pool is 130g. At some point after that, the 40g drive > fails, either by producing read errors or my failing to spin up at all. What happens > to my pool? Can I mount and access it at all (for the data not on or striped across > the 40g drive)? Can I "zfs replace" the 40g drive with another drive and have it > attempt to copy as much data over as it can? Or am I just out of luck? zfs seems like > a great way to use old/unutilized drives to expand capacity, but sooner or later one > of those drives will fail, and if it takes out the whole pool (which it might > reasonably do), then it doesn''t work out in the end.For non-redundant zpools, a device failure *may* cause the zpool to be unavailable. The actual availability depends on the nature of the failure. A more common scenario might be to add a 400 GByte drive, which you can use to replace the older drives, or keep online for redundancy. The zfs copies feature is a little bit harder to grok. It is difficult to predict how the system will be affected if you have copies=2 in your above scenario, because it depends on how the space is allocated. For more info, see my notes at: http://blogs.sun.com/relling/entry/zfs_copies_and_data_protection -- richard
On Wed, 2007-06-27 at 12:03 -0700, Jef Pearlman wrote:> > Jef Pearlman wrote: > > > Absent that, I was considering using zfs and just > > > having a single pool. My main question is this: what > > > is the failure mode of zfs if one of those drives > > > either fails completely or has errors? Do I > > > permanently lose access to the entire pool? Can I > > > attempt to read other data? Can I "zfs replace" the > > > bad drive and get some level of data recovery? > > > Otherwise, by pooling drives am I simply increasing > > > the probability of a catastrophic data loss? I > > > apologize if this is addressed elsewhere -- I''ve read > > > a bunch about zfs, but not come across this > > > particular answer. > >Pooling devices in a non-redundant mode (ie without a raidz or mirror vdev) increases your chance of losing data, just like every other RAID system out there. However, since ZFS doesn''t do concatenation (it stripes), by losing one drive in a non-redundant stripe, you effectively corrupt the entire dataset, as virtually all files should have some portion of their data on the dead drive.> > We generally recommend a single pool, as long as the > > use case permits. > > But I think you are confused about what a zpool is. > > I suggest you look > > t the examples or docs. A good overview is the slide > > show > > http://www.opensolaris.org/os/community/zfs/docs/zfs_ > > last.pdf > > Perhaps I''m not asking my question clearly. I''ve already experimented a fair amount with zfs, including creating and destroying a number of pools with and without redundancy, replacing vdevs, etc. Maybe asking by example will clarify what I''m looking for or where I''ve missed the boat. The key is that I want a grow-as-you-go heterogenous set of disks in my pool: > > Let''s say I start with a 40g drive and a 60g drive. I create a non-redundant pool (which will be 100g). At some later point, I run across an unused 30g drive, which I add to the pool. Now my pool is 130g. At some point after that, the 40g drive fails, either by producing read errors or my failing to spin up at all. What happens to my pool? Can I mount and access it at all (for the data not on or striped across the 40g drive)? Can I "zfs replace" the 40g drive with another drive and have it attempt to copy as much data over as it can? Or am I just out of luck? zfs seems like a great way to use old/unutilized drives to expand capacity, but sooner or later one of those drives will fail, and if it takes out the whole pool (which it might reasonably do), then it doesn''t work out in the end. >Nope. Your zpool is a stripe. As mentioned above, losing one disk in a stripe effectively destroys all data, just as with any other RAID system.> > > As a side-question, does anyone have a suggestion > > > for an intelligent way to approach this goal? This is > > > not mission-critical data, but I''d prefer not to make > > > data loss _more_ probable. Perhaps some volume > > > manager (like LVM on linux) has appropriate features? > > > > ZFS, mirrored pool will be the most performant and > > easiest to manage > > with better RAS than a raidz pool. > > The problem I''ve come across with using mirror or raidz for this setup is that (as far as I know) you can''t add disks to mirror/raidz groups, and if you just add the disk to the pool, you end up in the same situation as above (with more space but no redundancy). > > Thanks for your help. > > -Jef > >To answer the original question, you _have_ to create mirrors, which, if you have odd-sized disks, will end up with unused space. An example: Disk A: 20GB Disk B: 30GB Disk C: 40GB Disk D: 60GB Start with disk A & B: zpool create tank mirror A B results in a 20GB pool. Later, add disks C & D: zpool add tank mirror C D this results in a 2-wide stripe of 2 mirrors, which means there is a total capacity of 60GB (20GB for A & B, 40GB for B & C) of the pool. 10GB of the 30GB drive, and 20GB of the 60GB drive are currently unused. You can lose one drive from both pairs (i.e. A and C, A and D, B and C, or B and D) before any data loss. If you had known about the drive sizes beforehand, the you could have done something like this: Partition the drives as follows: A: 1 20GB partition B: 1 20gb & 1 10GB partition C: 1 40GB partition D: 1 40GB partition & 2 10GB paritions then you do: zpool create tank mirror Ap0 Bp0 mirror Cp0 Dp0 mirror Bp1 Dp1 and you get a total of 70GB of space. However, the performance on this is going to be bad (as you frequently need to write to both partitions on B & D, causing head seek), though you can still lose up to 2 drives before experiencing data loss. -- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Erik Trimble wrote:> If you had known about the drive sizes beforehand, the you could have > done something like this: > > Partition the drives as follows: > > A: 1 20GB partition > B: 1 20gb & 1 10GB partition > C: 1 40GB partition > D: 1 40GB partition & 2 10GB paritions > > then you do: > > zpool create tank mirror Ap0 Bp0 mirror Cp0 Dp0 mirror Bp1 Dp1 > > and you get a total of 70GB of space. However, the performance on this > is going to be bad (as you frequently need to write to both partitions > on B & D, causing head seek), though you can still lose up to 2 drives > before experiencing data loss.It is not clear to me that we can say performance will be bad for stripes on single disks. The reason is that ZFS dynamic striping does not use a fixed interleave. In other words, if I write a block of N bytes to a M-way dynamic stripe, it is not guaranteed that each device will get an I/O of N/M size. I''ve only done a few measurements of this, and I''ve not completed my analysis, but my data does not show the sort of thrashing one might expect from a fixed stripe with small interleave. -- richard
Richard Elling wrote:> Erik Trimble wrote: >> If you had known about the drive sizes beforehand, the you could have >> done something like this: >> >> Partition the drives as follows: >> >> A: 1 20GB partition >> B: 1 20gb & 1 10GB partition >> C: 1 40GB partition >> D: 1 40GB partition & 2 10GB paritions >> >> then you do: >> >> zpool create tank mirror Ap0 Bp0 mirror Cp0 Dp0 mirror Bp1 Dp1 >> >> and you get a total of 70GB of space. However, the performance on this >> is going to be bad (as you frequently need to write to both partitions >> on B & D, causing head seek), though you can still lose up to 2 drives >> before experiencing data loss. > > It is not clear to me that we can say performance will be bad > for stripes on single disks. The reason is that ZFS dynamic > striping does not use a fixed interleave. In other words, if > I write a block of N bytes to a M-way dynamic stripe, it is > not guaranteed that each device will get an I/O of N/M size. > I''ve only done a few measurements of this, and I''ve not completed > my analysis, but my data does not show the sort of thrashing one > might expect from a fixed stripe with small interleave. > -- richardThat is correct, Richard. However, it applies to relatively small read/writes, which do not exceed the max stripe size. Now, this is probably pretty likely, but there is another issue here: even given that not all disks will have an I/O on a stripe access, there still is a relatively good chance that both partitions on the disk get an I/O request. On average, I''d assume that you don''t really improve much over a full-stripe I/O, and in either case, it would be worse than a zpool which did not have multiple partitions on the same disk. Also, for large file access - where you guaranty the need for full-stripe access - you certainly are going to disk thrash. Numbers would be nice, of course. :-) -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA