thr3ads.net - zfs discuss - [zfs-discuss] Possible to do a stripe vdev? [Aug 2008]

If this information is useful, please help other people find it:
Share via:

John

2008-Aug-22 06:34 UTC

[zfs-discuss] Possible to do a stripe vdev?

I''m setting up a ZFS fileserver using a bunch of spare drives.
I''d like some redundancy and to maximize disk usage, so my plan was to
use raid-z. The problem is that the drives are considerably mismatched and I
haven''t found documentation (though I don''t see why it
shouldn''t be possible) to stripe smaller drives together to match
bigger ones. The drives are: 1x750, 2x500, 2x400, 2x320, 2x250. Is it possible
to accomplish the following with those drives:

raid-z
   750
   500+250=750
   500+250=750
   400+320=720
   400+720=720

and if so what is the command? The only way I''ve thought to implement
this is to create striped pools for each subset of drives, create a file that
fills the whole pool, and use the file as a vdev. That seems like a nasty
workaround though, and an unnecessary one. If there''s any trouble
understanding the question I can elaborate a bit. Any advice is greatly
appreciated!
 
 
This message posted from opensolaris.org

Bob Friesenhahn

2008-Aug-22 14:33 UTC

head link

[zfs-discuss] Possible to do a stripe vdev?

On Thu, 21 Aug 2008, John wrote:
> I''m setting up a ZFS fileserver using a bunch of spare drives.
I''d
> like some redundancy and to maximize disk usage, so my plan was to 
> use raid-z. The problem is that the drives are considerably 
> mismatched and I haven''t found documentation (though I
don''t see why
> it shouldn''t be possible) to stripe smaller drives together to
match
> bigger ones. The drives are: 1x750, 2x500, 2x400, 2x320, 2x250. Is 
> it possible to accomplish the following with those drives:
The ZFS vdev will only use up to the size of the smallest device in 
it.  If your smallest device is 250GB and another device in the same 
vdev is 350GB, then 100GB of that device will be ignored. While I 
would not really recommend it, a way out of the predicament is to use 
partitioning (via ''format'') to partition a large drive into
several
smaller partitions which are similar in size to your smaller drives. 
The reason why this is not recommended is that a single drive failure 
could then take out several logical devices and the vdev and pool 
could be toast.  With care, it could work ok with simple mirrors, but 
mirrors waste 1/2 the physical disk space.

The better solution is to try to build your vdevs out of similar-sized 
disk drives from the start.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Nils Goroll

2008-Aug-22 15:03 UTC

head link

[zfs-discuss] Possible to do a stripe vdev?

Hi,

John wrote:> I''m setting up a ZFS fileserver using a bunch of spare drives.
I''d like some redundancy and to maximize disk usage, so my plan was to
use raid-z. The problem is that the drives are considerably mismatched and I
haven''t found documentation (though I don''t see why it
shouldn''t be possible) to stripe smaller drives together to match
bigger ones. The drives are: 1x750, 2x500, 2x400, 2x320, 2x250. Is it possible
to accomplish the following with those drives:
> 
> raid-z
>    750
>    500+250=750
>    500+250=750
>    400+320=720
>    400+720=720

Though I''ve never used this in production, it seems possible to layer
ZFS on good old SDS (aka SVM, disksuite).

At least I managed to create a trivial pool on
what-10-mins-ago-was-my-swap-slice:

haggis:/var/tmp# metadb -f -a -c 3 /dev/dsk/c5t0d0s7 
haggis:/var/tmp# metainit d10 1 1 /dev/dsk/c5t0d0s1
d10: Concat/Stripe is setup
haggis:/var/tmp# zpool create test /dev/md/dsk/d10 
haggis:/var/tmp# zpool status test
  pool: test
 state: ONLINE
 scrub: none requested
config:

	NAME               STATE     READ WRITE CKSUM
	test               ONLINE       0     0     0
	  /dev/md/dsk/d10  ONLINE       0     0     0

So it looks like you could do the follwing:

* Put a small slice (10-20m should suffice, by convention it''s slice 7
on the first cylinders) on each of your disks and make them the metadb, if you
are not using SDS already
  metadb -f -a -c 3 <all your slices_7>

  make slice 0 the remainder of each disk

* for your 500/250G drives, create a concat (stripe not possible) for each pair.
for clarity, I''d recommend to include the 750G disk as well (syntax
from memory, apologies if I''m wrong with details):

metainit d11 1 1 <700G disk>s0
metainit d12 2 1 <500G disk>s0 1 <250G disk>s0
metainit d13 2 1 <500G disk>s0 1 <250G disk>s0
metainit d14 2 1 <400G disk>s0 1 <320G disk>s0
metainit d15 2 1 <400G disk>s0 1 <320G disk>s0

* create a raidz pool on your metadevices

zpool create <name> raidz /dev/md/dsk/d11 /dev/md/dsk/d12 /dev/md/dsk/d13
/dev/md/dsk/d14 /dev/md/dsk/d15

Again: I have never tried this, so please don''t blame me if this
doesn''t work.

Nils
 
 
This message posted from opensolaris.org

Chris Cosby

2008-Aug-22 17:11 UTC

head link

[zfs-discuss] Possible to do a stripe vdev?

About the best I can see:

zpool create dirtypool raidz 250a 250b 320a raidz 320b 400a 400b raidz 500a
500b 750a

And you have to do them in that order. The zpool will create using the
smallest device. This gets you about 2140GB (500 + 640 + 1000) of space.
Your desired method is only 2880GB (720 * 4) and is WAY harder to setup and
maintain, especially if you get into the SDS configuration.

I, for one, welcome our convoluted configuration overlords. I''d also
like to
see what the zpool looks like if it works. This is, obviously, untested.

chris


On Fri, Aug 22, 2008 at 11:03 AM, Nils Goroll <nils.goroll at
hamburg.de>wrote:
> Hi,
>
> John wrote:
> > I''m setting up a ZFS fileserver using a bunch of spare
drives. I''d like
> some redundancy and to maximize disk usage, so my plan was to use raid-z.
> The problem is that the drives are considerably mismatched and I
haven''t
> found documentation (though I don''t see why it shouldn''t
be possible) to
> stripe smaller drives together to match bigger ones. The drives are: 1x750,
> 2x500, 2x400, 2x320, 2x250. Is it possible to accomplish the following with
> those drives:
> >
> > raid-z
> >    750
> >    500+250=750
> >    500+250=750
> >    400+320=720
> >    400+720=720
>
>
> Though I''ve never used this in production, it seems possible to
layer ZFS
> on good old SDS (aka SVM, disksuite).
>
> At least I managed to create a trivial pool on
> what-10-mins-ago-was-my-swap-slice:
>
> haggis:/var/tmp# metadb -f -a -c 3 /dev/dsk/c5t0d0s7
> haggis:/var/tmp# metainit d10 1 1 /dev/dsk/c5t0d0s1
> d10: Concat/Stripe is setup
> haggis:/var/tmp# zpool create test /dev/md/dsk/d10
> haggis:/var/tmp# zpool status test
>  pool: test
>  state: ONLINE
>  scrub: none requested
> config:
>
>        NAME               STATE     READ WRITE CKSUM
>        test               ONLINE       0     0     0
>          /dev/md/dsk/d10  ONLINE       0     0     0
>
> So it looks like you could do the follwing:
>
> * Put a small slice (10-20m should suffice, by convention it''s
slice 7 on
> the first cylinders) on each of your disks and make them the metadb, if you
> are not using SDS already
>  metadb -f -a -c 3 <all your slices_7>
>
>  make slice 0 the remainder of each disk
>
> * for your 500/250G drives, create a concat (stripe not possible) for each
> pair. for clarity, I''d recommend to include the 750G disk as well
(syntax
> from memory, apologies if I''m wrong with details):
>
> metainit d11 1 1 <700G disk>s0
> metainit d12 2 1 <500G disk>s0 1 <250G disk>s0
> metainit d13 2 1 <500G disk>s0 1 <250G disk>s0
> metainit d14 2 1 <400G disk>s0 1 <320G disk>s0
> metainit d15 2 1 <400G disk>s0 1 <320G disk>s0
>
> * create a raidz pool on your metadevices
>
> zpool create <name> raidz /dev/md/dsk/d11 /dev/md/dsk/d12
/dev/md/dsk/d13
> /dev/md/dsk/d14 /dev/md/dsk/d15
>
> Again: I have never tried this, so please don''t blame me if this
doesn''t
> work.
>
> Nils
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>


-- 
chris -at- microcozm -dot- net
=== Si Hoc Legere Scis Nimium Eruditionis Habes
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080822/d7defe7d/attachment.html>

Kyle McDonald

2008-Aug-22 17:23 UTC

head link

[zfs-discuss] Possible to do a stripe vdev?

Chris Cosby wrote:> About the best I can see:
>
> zpool create dirtypool raidz 250a 250b 320a raidz 320b 400a 400b raidz 
> 500a 500b 750a
>
> And you have to do them in that order. The zpool will create using the 
> smallest device. This gets you about 2140GB (500 + 640 + 1000) of 
> space. Your desired method is only 2880GB (720 * 4) and is WAY harder 
> to setup and maintain, especially if you get into the SDS configuration.
>
> I, for one, welcome our convoluted configuration overlords. I''d
also
> like to see what the zpool looks like if it works. This is, obviously, 
> untested.
>I don''t think I''d be that comfortable doing it, but I suppose
you could
just add each drive as a separate vDev, and set copies=2, but even that 
would get you about 1825GB (If my math is right the disks add up to 3650GB)

  -Kyle
> chris
>
>
> On Fri, Aug 22, 2008 at 11:03 AM, Nils Goroll <nils.goroll at hamburg.de
> <mailto:nils.goroll at hamburg.de>> wrote:
>
>     Hi,
>
>     John wrote:
>     > I''m setting up a ZFS fileserver using a bunch of spare
drives.
>     I''d like some redundancy and to maximize disk usage, so my
plan
>     was to use raid-z. The problem is that the drives are considerably
>     mismatched and I haven''t found documentation (though I
don''t see
>     why it shouldn''t be possible) to stripe smaller drives
together to
>     match bigger ones. The drives are: 1x750, 2x500, 2x400, 2x320,
>     2x250. Is it possible to accomplish the following with those drives:
>     >
>     > raid-z
>     >    750
>     >    500+250=750
>     >    500+250=750
>     >    400+320=720
>     >    400+720=720
>
>
>     Though I''ve never used this in production, it seems possible
to
>     layer ZFS on good old SDS (aka SVM, disksuite).
>
>     At least I managed to create a trivial pool on
>     what-10-mins-ago-was-my-swap-slice:
>
>     haggis:/var/tmp# metadb -f -a -c 3 /dev/dsk/c5t0d0s7
>     haggis:/var/tmp# metainit d10 1 1 /dev/dsk/c5t0d0s1
>     d10: Concat/Stripe is setup
>     haggis:/var/tmp# zpool create test /dev/md/dsk/d10
>     haggis:/var/tmp# zpool status test
>      pool: test
>      state: ONLINE
>      scrub: none requested
>     config:
>
>            NAME               STATE     READ WRITE CKSUM
>            test               ONLINE       0     0     0
>              /dev/md/dsk/d10  ONLINE       0     0     0
>
>     So it looks like you could do the follwing:
>
>     * Put a small slice (10-20m should suffice, by convention it''s
>     slice 7 on the first cylinders) on each of your disks and make
>     them the metadb, if you are not using SDS already
>      metadb -f -a -c 3 <all your slices_7>
>
>      make slice 0 the remainder of each disk
>
>     * for your 500/250G drives, create a concat (stripe not possible)
>     for each pair. for clarity, I''d recommend to include the 750G
disk
>     as well (syntax from memory, apologies if I''m wrong with
details):
>
>     metainit d11 1 1 <700G disk>s0
>     metainit d12 2 1 <500G disk>s0 1 <250G disk>s0
>     metainit d13 2 1 <500G disk>s0 1 <250G disk>s0
>     metainit d14 2 1 <400G disk>s0 1 <320G disk>s0
>     metainit d15 2 1 <400G disk>s0 1 <320G disk>s0
>
>     * create a raidz pool on your metadevices
>
>     zpool create <name> raidz /dev/md/dsk/d11 /dev/md/dsk/d12
>     /dev/md/dsk/d13 /dev/md/dsk/d14 /dev/md/dsk/d15
>
>     Again: I have never tried this, so please don''t blame me if
this
>     doesn''t work.
>
>     Nils
>
>
>     This message posted from opensolaris.org <http://opensolaris.org>
>     _______________________________________________
>     zfs-discuss mailing list
>     zfs-discuss at opensolaris.org <mailto:zfs-discuss at
opensolaris.org>
>     http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
>
>
> -- 
> chris -at- microcozm -dot- net
> === Si Hoc Legere Scis Nimium Eruditionis Habes
> ------------------------------------------------------------------------
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Heikki Suonsivu on list forwarder

2008-Aug-23 22:04 UTC

head link

[zfs-discuss] Possible to do a stripe vdev?

Kyle McDonald wrote:> Chris Cosby wrote:
>> About the best I can see:
>>
>> zpool create dirtypool raidz 250a 250b 320a raidz 320b 400a 400b raidz 
>> 500a 500b 750a
>>
>> And you have to do them in that order. The zpool will create using the 
>> smallest device. This gets you about 2140GB (500 + 640 + 1000) of 
>> space. Your desired method is only 2880GB (720 * 4) and is WAY harder 
>> to setup and maintain, especially if you get into the SDS
configuration.
>>
>> I, for one, welcome our convoluted configuration overlords.
I''d also
>> like to see what the zpool looks like if it works. This is, obviously, 
>> untested.
>>
> I don''t think I''d be that comfortable doing it, but I
suppose you could
> just add each drive as a separate vDev, and set copies=2, but even that 
> would get you about 1825GB (If my math is right the disks add up to 3650GB)
> 
>   -Kyle
There seems to be confusion about whether this works or not.

- Marketing speak says metadata is redundant, and in case of at least 
two disks, it is distributed on at least two disks.

- In case of filesystems where copies=2 this should also happen to file data

- Which should mean that above configuration should be redundant and 
tolerate loss of one disk.

- People having trouble on the list say that it does not work, if for 
any reason, after disk failure, the system shuts down, crashes, etc, 
because you cannot mount the pools - they are in unavailable state, even 
though according to marketing speak it should be possible to mount and 
go on, and recover all files with copies=2+, and for other files get 
information on which files are bad.

- So, the QUESTION is:  Is the marketing speak totally bogus, or is 
there missing code/bug/etc which prevents getting pool with a lost disk 
on-line (looping back to first question).

Heikki
>> chris
>>
>>
>> On Fri, Aug 22, 2008 at 11:03 AM, Nils Goroll <nils.goroll at
hamburg.de
>> <mailto:nils.goroll at hamburg.de>> wrote:
>>
>>     Hi,
>>
>>     John wrote:
>>     > I''m setting up a ZFS fileserver using a bunch of
spare drives.
>>     I''d like some redundancy and to maximize disk usage, so my
plan
>>     was to use raid-z. The problem is that the drives are considerably
>>     mismatched and I haven''t found documentation (though I
don''t see
>>     why it shouldn''t be possible) to stripe smaller drives
together to
>>     match bigger ones. The drives are: 1x750, 2x500, 2x400, 2x320,
>>     2x250. Is it possible to accomplish the following with those
drives:
>>     >
>>     > raid-z
>>     >    750
>>     >    500+250=750
>>     >    500+250=750
>>     >    400+320=720
>>     >    400+720=720
>>
>>
>>     Though I''ve never used this in production, it seems
possible to
>>     layer ZFS on good old SDS (aka SVM, disksuite).
>>
>>     At least I managed to create a trivial pool on
>>     what-10-mins-ago-was-my-swap-slice:
>>
>>     haggis:/var/tmp# metadb -f -a -c 3 /dev/dsk/c5t0d0s7
>>     haggis:/var/tmp# metainit d10 1 1 /dev/dsk/c5t0d0s1
>>     d10: Concat/Stripe is setup
>>     haggis:/var/tmp# zpool create test /dev/md/dsk/d10
>>     haggis:/var/tmp# zpool status test
>>      pool: test
>>      state: ONLINE
>>      scrub: none requested
>>     config:
>>
>>            NAME               STATE     READ WRITE CKSUM
>>            test               ONLINE       0     0     0
>>              /dev/md/dsk/d10  ONLINE       0     0     0
>>
>>     So it looks like you could do the follwing:
>>
>>     * Put a small slice (10-20m should suffice, by convention
it''s
>>     slice 7 on the first cylinders) on each of your disks and make
>>     them the metadb, if you are not using SDS already
>>      metadb -f -a -c 3 <all your slices_7>
>>
>>      make slice 0 the remainder of each disk
>>
>>     * for your 500/250G drives, create a concat (stripe not possible)
>>     for each pair. for clarity, I''d recommend to include the
750G disk
>>     as well (syntax from memory, apologies if I''m wrong with
details):
>>
>>     metainit d11 1 1 <700G disk>s0
>>     metainit d12 2 1 <500G disk>s0 1 <250G disk>s0
>>     metainit d13 2 1 <500G disk>s0 1 <250G disk>s0
>>     metainit d14 2 1 <400G disk>s0 1 <320G disk>s0
>>     metainit d15 2 1 <400G disk>s0 1 <320G disk>s0
>>
>>     * create a raidz pool on your metadevices
>>
>>     zpool create <name> raidz /dev/md/dsk/d11 /dev/md/dsk/d12
>>     /dev/md/dsk/d13 /dev/md/dsk/d14 /dev/md/dsk/d15
>>
>>     Again: I have never tried this, so please don''t blame me
if this
>>     doesn''t work.
>>
>>     Nils
>>
>>
>>     This message posted from opensolaris.org
<http://opensolaris.org>
>>     _______________________________________________
>>     zfs-discuss mailing list
>>     zfs-discuss at opensolaris.org <mailto:zfs-discuss at
opensolaris.org>
>>     http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>>
>>
>>
>> -- 
>> chris -at- microcozm -dot- net
>> === Si Hoc Legere Scis Nimium Eruditionis Habes
>>
------------------------------------------------------------------------
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>   
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Heikki Suonsivu on list forwarder

2008-Aug-24 10:26 UTC

head link

[zfs-discuss] Redundancy with a stripe vdev and copies=2

Nils Goroll wrote:> Hi,
> 
> Heikki Suonsivu on list forwarder wrote:
>> - So, the QUESTION is:  Is the marketing speak totally bogus, or is 
>> there missing code/bug/etc which prevents getting pool with a lost 
>> disk on-line (looping back to first question).
> 
> Besides those practical aspects, for me the question is if there is any 
> guarantee that with copies>=2, all copies will be placed on different 
> vdevs if possible.
That is what the manual page says:

: The copies are stored on different  disks, if possible.

(Though "if possible" is not defined.  I did not peek into code yet)
> Nils
Heikki

Richard Elling

2008-Aug-25 04:16 UTC

head link

[zfs-discuss] Possible to do a stripe vdev?

Heikki Suonsivu on list forwarder wrote:> Kyle McDonald wrote:
>   
>> Chris Cosby wrote:
>>     
>>> About the best I can see:
>>>
>>> zpool create dirtypool raidz 250a 250b 320a raidz 320b 400a 400b
raidz
>>> 500a 500b 750a
>>>
>>> And you have to do them in that order. The zpool will create using
the
>>> smallest device. This gets you about 2140GB (500 + 640 + 1000) of 
>>> space. Your desired method is only 2880GB (720 * 4) and is WAY
harder
>>> to setup and maintain, especially if you get into the SDS
configuration.
>>>
>>> I, for one, welcome our convoluted configuration overlords.
I''d also
>>> like to see what the zpool looks like if it works. This is,
obviously,
>>> untested.
>>>
>>>       
>> I don''t think I''d be that comfortable doing it, but I
suppose you could
>> just add each drive as a separate vDev, and set copies=2, but even that
>> would get you about 1825GB (If my math is right the disks add up to
3650GB)
>>
>>   -Kyle
>>     
>
> There seems to be confusion about whether this works or not.
>   
Of course it works as designed...
> - Marketing speak says metadata is redundant, and in case of at least 
> two disks, it is distributed on at least two disks.
>   
I''m not sure what "marketing speak" you''re referring
to, there are
very few marketing materials for ZFS.  Do you have a pointer?
> - In case of filesystems where copies=2 this should also happen to file
data
>
>   
Yes, by definition copies=2 makes the data double redundant and
the metadata triple redundant.
> - Which should mean that above configuration should be redundant and 
> tolerate loss of one disk.
>   
It depends on the failure mode.  If the disk suffers catastrophic
death, then you are in a situation where the entire set of top-level
vdevs is not available.  Depending on the exact configuration,
loss of a top-level vdev will cause the pool to not be importable.
For the more common failure modes, it should recover nicely.
I believe that the most common use cases for copies=2 is for
truly important data or the single-vdev case.
> - People having trouble on the list say that it does not work, if for 
> any reason, after disk failure, the system shuts down, crashes, etc, 
> because you cannot mount the pools - they are in unavailable state, even 
> though according to marketing speak it should be possible to mount and 
> go on, and recover all files with copies=2+, and for other files get 
> information on which files are bad.
>   
It depends on the failure mode.  Most of the ZFS versions out there
do have the ability to identify files which have been corrupted with
the "zfs status -xv" options.
> - So, the QUESTION is:  Is the marketing speak totally bogus, or is 
> there missing code/bug/etc which prevents getting pool with a lost disk 
> on-line (looping back to first question).
>   
Real life dictates that there is no one, single, true answer -- just a 
series
of trade-offs. If you ask me, I say make your data redundant by at least
one method.  More redundancy is better.

more below...> Heikki
>
>   
>>> chris
>>>
>>>
>>> On Fri, Aug 22, 2008 at 11:03 AM, Nils Goroll <nils.goroll at
hamburg.de
>>> <mailto:nils.goroll at hamburg.de>> wrote:
>>>
>>>     Hi,
>>>
>>>     John wrote:
>>>     > I''m setting up a ZFS fileserver using a bunch of
spare drives.
>>>     I''d like some redundancy and to maximize disk usage,
so my plan
>>>     was to use raid-z. The problem is that the drives are
considerably
>>>     mismatched and I haven''t found documentation (though I
don''t see
>>>     why it shouldn''t be possible) to stripe smaller drives
together to
>>>     match bigger ones. The drives are: 1x750, 2x500, 2x400, 2x320,
>>>     2x250. Is it possible to accomplish the following with those
drives:
>>>       
Don''t worry about how much space you have, worry about how
much space you need, over time.  Consider growing your needs
into the space over time.  For example, if you need 100 GBytes today,
400 GBytes in 6 months, and 1 TByte next year, then start with:
    zpool create mypool mirror 320 320
    [turn off the remaining disks, no need to burn the power or lifetime]

in 6 months
    zpool add mypool mirror 400 400

next year
    zpool add mypool mirror 500 500

US street price for disks runs about $100, but the density increases
over time, so you could also build in a migration every two years
or so.
    zpool replace mypool 320 1500 [do this for each side]

-- richard

Richard Elling

2008-Aug-25 04:23 UTC

head link

[zfs-discuss] Possible to do a stripe vdev?

[clarification below...]

Richard Elling wrote:> Heikki Suonsivu on list forwarder wrote:
>   
>> Kyle McDonald wrote:
>>   
>>     
>>> Chris Cosby wrote:
>>>     
>>>       
>>>> About the best I can see:
>>>>
>>>> zpool create dirtypool raidz 250a 250b 320a raidz 320b 400a
400b raidz
>>>> 500a 500b 750a
>>>>
>>>> And you have to do them in that order. The zpool will create
using the
>>>> smallest device. This gets you about 2140GB (500 + 640 + 1000)
of
>>>> space. Your desired method is only 2880GB (720 * 4) and is WAY
harder
>>>> to setup and maintain, especially if you get into the SDS
configuration.
>>>>
>>>> I, for one, welcome our convoluted configuration overlords.
I''d also
>>>> like to see what the zpool looks like if it works. This is,
obviously,
>>>> untested.
>>>>
>>>>       
>>>>         
>>> I don''t think I''d be that comfortable doing it,
but I suppose you could
>>> just add each drive as a separate vDev, and set copies=2, but even
that
>>> would get you about 1825GB (If my math is right the disks add up to
3650GB)
>>>
>>>   -Kyle
>>>     
>>>       
>> There seems to be confusion about whether this works or not.
>>   
>>     
>
> Of course it works as designed...
>
>   
>> - Marketing speak says metadata is redundant, and in case of at least 
>> two disks, it is distributed on at least two disks.
>>   
>>     
>
> I''m not sure what "marketing speak" you''re
referring to, there are
> very few marketing materials for ZFS.  Do you have a pointer?
>
>   
>> - In case of filesystems where copies=2 this should also happen to file
data
>>
>>   
>>     
>
> Yes, by definition copies=2 makes the data double redundant and
> the metadata triple redundant.
>
>   
>> - Which should mean that above configuration should be redundant and 
>> tolerate loss of one disk.
>>   
>>     
>
> It depends on the failure mode.  If the disk suffers catastrophic
> death, then you are in a situation where the entire set of top-level
> vdevs is not available.  Depending on the exact configuration,
> loss of a top-level vdev will cause the pool to not be importable.
>   
clarification: the assumption I made here was that the top-level vdev
is not protected.  If the top-level vdev is protected (mirrored, raidz[12])
then loss of a disk will still result in the pool being importable.  The
key concept here is that the copies features works above and in addition
to any vdev redundancy.
 -- richard
> For the more common failure modes, it should recover nicely.
> I believe that the most common use cases for copies=2 is for
> truly important data or the single-vdev case.
>
>   
>> - People having trouble on the list say that it does not work, if for 
>> any reason, after disk failure, the system shuts down, crashes, etc, 
>> because you cannot mount the pools - they are in unavailable state,
even
>> though according to marketing speak it should be possible to mount and 
>> go on, and recover all files with copies=2+, and for other files get 
>> information on which files are bad.
>>   
>>     
>
> It depends on the failure mode.  Most of the ZFS versions out there
> do have the ability to identify files which have been corrupted with
> the "zfs status -xv" options.
>
>   
>> - So, the QUESTION is:  Is the marketing speak totally bogus, or is 
>> there missing code/bug/etc which prevents getting pool with a lost disk
>> on-line (looping back to first question).
>>   
>>     
>
> Real life dictates that there is no one, single, true answer -- just a 
> series
> of trade-offs. If you ask me, I say make your data redundant by at least
> one method.  More redundancy is better.
>
> more below...
>   
>> Heikki
>>
>>   
>>     
>>>> chris
>>>>
>>>>
>>>> On Fri, Aug 22, 2008 at 11:03 AM, Nils Goroll <nils.goroll
at hamburg.de
>>>> <mailto:nils.goroll at hamburg.de>> wrote:
>>>>
>>>>     Hi,
>>>>
>>>>     John wrote:
>>>>     > I''m setting up a ZFS fileserver using a bunch
of spare drives.
>>>>     I''d like some redundancy and to maximize disk
usage, so my plan
>>>>     was to use raid-z. The problem is that the drives are
considerably
>>>>     mismatched and I haven''t found documentation
(though I don''t see
>>>>     why it shouldn''t be possible) to stripe smaller
drives together to
>>>>     match bigger ones. The drives are: 1x750, 2x500, 2x400,
2x320,
>>>>     2x250. Is it possible to accomplish the following with
those drives:
>>>>       
>>>>         
>
> Don''t worry about how much space you have, worry about how
> much space you need, over time.  Consider growing your needs
> into the space over time.  For example, if you need 100 GBytes today,
> 400 GBytes in 6 months, and 1 TByte next year, then start with:
>     zpool create mypool mirror 320 320
>     [turn off the remaining disks, no need to burn the power or lifetime]
>
> in 6 months
>     zpool add mypool mirror 400 400
>
> next year
>     zpool add mypool mirror 500 500
>
> US street price for disks runs about $100, but the density increases
> over time, so you could also build in a migration every two years
> or so.
>     zpool replace mypool 320 1500 [do this for each side]
>
> -- richard
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Heikki Suonsivu on list forwarder

2008-Aug-25 06:25 UTC

head link

[zfs-discuss] Possible to do a stripe vdev?

Richard Elling wrote:
 > Heikki Suonsivu on list forwarder wrote:
 >> Kyle McDonald wrote:
 >>
 >>> Chris Cosby wrote:
 >>>
 >>>> About the best I can see:
 >>>>
 >>>> zpool create dirtypool raidz 250a 250b 320a raidz 320b 400a
400b
 >>>> raidz 500a 500b 750a
 >>>>
 >>>> And you have to do them in that order. The zpool will create
using
 >>>> the smallest device. This gets you about 2140GB (500 + 640 +
1000)
 >>>> of space. Your desired method is only 2880GB (720 * 4) and is
WAY
 >>>> harder to setup and maintain, especially if you get into the
SDS
 >>>> configuration.
 >>>>
 >>>> I, for one, welcome our convoluted configuration overlords.
I''d also
 >>>> like to see what the zpool looks like if it works. This is,
 >>>> obviously, untested.
 >>>>
 >>>>
 >>> I don''t think I''d be that comfortable doing it,
but I suppose you
 >>> could just add each drive as a separate vDev, and set copies=2,
but
 >>> even that would get you about 1825GB (If my math is right the
disks
 >>> add up to 3650GB)
 >>>
 >>>   -Kyle
 >>>
 >>
 >> There seems to be confusion about whether this works or not.
 >>
 >
 > Of course it works as designed...
 >
 >> - Marketing speak says metadata is redundant, and in case of at least
 >> two disks, it is distributed on at least two disks.
 >>
 >
 > I''m not sure what "marketing speak" you''re
referring to, there are
 > very few marketing materials for ZFS.  Do you have a pointer?

This particular claim was from zfs manual page.

 >> - In case of filesystems where copies=2 this should also happen to
 >> file data
 >>
 >>
 >
 > Yes, by definition copies=2 makes the data double redundant and
 > the metadata triple redundant.
 >
 >> - Which should mean that above configuration should be redundant and
 >> tolerate loss of one disk.
 >>
 >
 > It depends on the failure mode.  If the disk suffers catastrophic
 > death, then you are in a situation where the entire set of top-level
 > vdevs is not available.  Depending on the exact configuration,
 > loss of a top-level vdev will cause the pool to not be importable.
 > For the more common failure modes, it should recover nicely.
 > I believe that the most common use cases for copies=2 is for
 > truly important data or the single-vdev case.

Out of last ten failures or so, I do remember increasing bad blocks in 
two cases, drive dying within few minutes of startup in one case, and 
total drive death in all the others, making funky noises or none at all. 
  So, lets assume that this is the most common case, drive dies.  To 
simplify question, lets assume that it is replaced with an empty one (as 
one would do in real RAID case).  That would mean that all blocks read 
from that disk would return data which does not match checksums.

As all metadata and data has been written to multiple drives, this 
should be recoverable situation and the computer should come up with all 
files accessible, with some log warnings about situation.  Why would it 
not be?

 >> - People having trouble on the list say that it does not work, if for
 >> any reason, after disk failure, the system shuts down, crashes, etc,
 >> because you cannot mount the pools - they are in unavailable state,
 >> even though according to marketing speak it should be possible to
 >> mount and go on, and recover all files with copies=2+, and for other
 >> files get information on which files are bad.
 >>
 >
 > It depends on the failure mode.  Most of the ZFS versions out there
 > do have the ability to identify files which have been corrupted with
 > the "zfs status -xv" options.
 >
 >> - So, the QUESTION is:  Is the marketing speak totally bogus, or is
 >> there missing code/bug/etc which prevents getting pool with a lost
 >> disk on-line (looping back to first question).
 >>
 >
 > Real life dictates that there is no one, single, true answer -- just a
 > series
 > of trade-offs. If you ask me, I say make your data redundant by at least
 > one method.  More redundancy is better.
 >
 > more below...
 >> Heikki
 >>
 >>
 >>>> chris
 >>>>
 >>>>
 >>>> On Fri, Aug 22, 2008 at 11:03 AM, Nils Goroll
 >>>> <nils.goroll at hamburg.de <mailto:nils.goroll at
hamburg.de>> wrote:
 >>>>
 >>>>     Hi,
 >>>>
 >>>>     John wrote:
 >>>>     > I''m setting up a ZFS fileserver using a
bunch of spare drives.
 >>>>     I''d like some redundancy and to maximize disk
usage, so my plan
 >>>>     was to use raid-z. The problem is that the drives are
considerably
 >>>>     mismatched and I haven''t found documentation
(though I don''t see
 >>>>     why it shouldn''t be possible) to stripe smaller
drives together to
 >>>>     match bigger ones. The drives are: 1x750, 2x500, 2x400,
2x320,
 >>>>     2x250. Is it possible to accomplish the following with
those
 >>>> drives:
 >>>>
 >
 > Don''t worry about how much space you have, worry about how
 > much space you need, over time.  Consider growing your needs
 > into the space over time.  For example, if you need 100 GBytes today,
 > 400 GBytes in 6 months, and 1 TByte next year, then start with:
 >    zpool create mypool mirror 320 320
 >    [turn off the remaining disks, no need to burn the power or lifetime]
 >
 > in 6 months
 >    zpool add mypool mirror 400 400
 >
 > next year
 >    zpool add mypool mirror 500 500
 >
 > US street price for disks runs about $100, but the density increases
 > over time, so you could also build in a migration every two years
 > or so.
 >    zpool replace mypool 320 1500 [do this for each side]
 >
 > -- richard

zfs discuss - Aug 2008 - Possible to do a stripe vdev?

[zfs-discuss] Possible to do a stripe vdev?

[zfs-discuss] Possible to do a stripe vdev?

[zfs-discuss] Possible to do a stripe vdev?

[zfs-discuss] Possible to do a stripe vdev?

[zfs-discuss] Possible to do a stripe vdev?

[zfs-discuss] Possible to do a stripe vdev?

[zfs-discuss] Redundancy with a stripe vdev and copies=2

[zfs-discuss] Possible to do a stripe vdev?

[zfs-discuss] Possible to do a stripe vdev?

[zfs-discuss] Possible to do a stripe vdev?