thr3ads.net - zfs discuss - [zfs-discuss] Need input on implementing a ZFS layout [Sep 2006]

If this information is useful, please help other people find it:
Share via:

Oatway, Ted

2006-Sep-05 23:55 UTC

[zfs-discuss] Need input on implementing a ZFS layout

IHAC that has 560+ LUNs that will be assigned to ZFS Pools and some
level of protection. The LUNs are provided by seven Sun StorageTek
FLX380s. Each FLX380 is configured with 20 Virtual Disks. Each Virtual
Disk presents four Volumes/LUNs. (4 Volumes x 20 Virtual Disks x 7 Disk
Arrays = 560 LUNs in total)

We want to protect against all possible scenarios including the loss of
a Virtual Disk (which would take out four Volumes) and the loss of a
FLX380 (which would take out 80 Volumes).

Today the customer has taken some number of LUNs from each of the arrays
and put them into one ZFS Pool. They then create R5(15+1) RAIDz virtual
disks (??) manually selecting LUNs to try and get the required level of
redundancy.

The issues are:

1) This is a management nightmare doing it this way

2) It is way too easy to make a mistake and have a RAIDz group
that is not configured properly

3) It would be extremely difficult to scale this type of
architecture if we later added a single FLX380 (6540) to the mix

I do not yet understand ZFS but I have some ideas on how I think it
works and it seems to me that surely ZFS can handle this in a more
eloquent manner than what the customer is doing today. So while I am
coming up to speed on how to architect solutions using ZFS can any one
of you help me think this through to make sure the customer meets all of
their objectives?

This is somewhat of a time sensitive situation.

Ted Oatway
Principal Solutions Architect
Office of the CTO, Public Sector
Storage Sales Group
Sun Microsystems
ted.oatway at sun.com <mailto:ted.oatway at sun.com>
206.276.0769 Office
206.276.0769 Mobile

NOTICE: This email message is for the sole use of the intended
recipient(s) and may contain confidential and privileged information.
Any unauthorized review, use, disclosure or distribution is prohibited.
If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060905/2bdaf739/attachment.html>

Richard Elling - PAE

2006-Sep-06 00:50 UTC

head link

[zfs-discuss] Need input on implementing a ZFS layout

Oatway, Ted wrote:> IHAC that has 560+ LUNs that will be assigned to ZFS Pools and some 
> level of protection. The LUNs are provided by seven Sun StorageTek 
> FLX380s. Each FLX380 is configured with 20 Virtual Disks. Each Virtual 
> Disk presents four Volumes/LUNs.  (4 Volumes x 20 Virtual Disks x 7 Disk 
> Arrays = 560 LUNs in total)
> 
> We want to protect against all possible scenarios including the loss of 
> a Virtual Disk (which would take out four Volumes) and the loss of a 
> FLX380 (which would take out 80 Volumes).
This means that your maximum number of columns is N, where N is the whole
device you could stand to lose before data availability is compromised.
In this case, that number is 7 (FLX380s).
> Today the customer has taken some number of LUNs from each of the arrays 
> and put them into one ZFS Pool. They then create R5(15+1) RAIDz virtual 
> disks (??) manually selecting LUNs to try and get the required level of 
> redundancy.
Because your limit is 7, then a single parity solution like RAID-Z would
dictate that the maximum size should be RAID-Z (6+1).  Incidentally, you
will be happier with 6+1 than 15+1 for most cases.

For 2-way mirrors, then you would want to go with rotating pairs of 1/2 of a
FLX380 array.

For RAID-Z2, dual parity, you would implement RAID-Z2(5+2).  In general,
RAID-Z2 would give you the best data availability and data loss protection
along with relatively good available space.  Caveat: I can''t say when
RAID-Z2
will be available for non-Express Solaris versions, I have zero involvement with
Solaris release schedules.

More constraints below...
> The issues are:
> 
> 1)        This is a management nightmare doing it this way
automate
> 2)        It is way too easy to make a mistake and have a RAIDz group 
> that is not configured properly
automate

NB.  this isn''t as difficult to change later with ZFS than with some
other LVMs.  As long as the top-level requirements follow a consistent
design, changing the lower-level implementation can be done later online.
Worry about the top-level vdevs which will be dictated by the number of
FLX380s as shown above.
> 3)        It would be extremely difficult to scale this type of 
> architecture if we later added a single FLX380 (6540) to the mix
The only (easy) way to scale while adding a single item, and still retain the
same availability characteristics, is to use a mirror.

To go further down this line of thought would require the customer
to articulate how they would rank the following requirements:
	+ space
	+ availability
	+ performance
because you will need to trade these off.
  -- richard

Oatway, Ted

2006-Sep-06 01:38 UTC

head link

[zfs-discuss] Need input on implementing a ZFS layout

Thanks for the response Richard. Forgive my ignorance but the following
questions come to mind as I read your response.

I would then have to create 80 RAIDz(6+1) Volumes and the process of
creating these Volumes can be scripted. But -

1) I would then have to create 80 mount points to mount each of these
Volumes (?)

2) I would have no load balancing across mount points and I would have
to specifically direct the files to a mount point using an algorithm of
some design

3) A file landing on any one mount point would be constrained to the I/O
of the underlying disk which would represent 1/80th of the potential
available

4) Expansion of the architecture, by adding in another single disk
array, would be difficult and would probably be some form of data
migration (?). For 800TB of data that would be unacceptable.



Ted Oatway
Sun Microsystems
206.276.0769 Mobile


-----Original Message-----
From: Richard.Elling at sun.com [mailto:Richard.Elling at sun.com] 
Sent: Tuesday, September 05, 2006 5:50 PM
To: Oatway, Ted
Cc: zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] Need input on implementing a ZFS layout

Oatway, Ted wrote:> IHAC that has 560+ LUNs that will be assigned to ZFS Pools and some 
> level of protection. The LUNs are provided by seven Sun StorageTek 
> FLX380s. Each FLX380 is configured with 20 Virtual Disks. Each Virtual
> Disk presents four Volumes/LUNs.  (4 Volumes x 20 Virtual Disks x 7
Disk > Arrays = 560 LUNs in total)
> 
> We want to protect against all possible scenarios including the loss
of > a Virtual Disk (which would take out four Volumes) and the loss of a 
> FLX380 (which would take out 80 Volumes).
This means that your maximum number of columns is N, where N is the
whole
device you could stand to lose before data availability is compromised.
In this case, that number is 7 (FLX380s).
> Today the customer has taken some number of LUNs from each of the
arrays > and put them into one ZFS Pool. They then create R5(15+1) RAIDz
virtual > disks (??) manually selecting LUNs to try and get the required level
of > redundancy.
Because your limit is 7, then a single parity solution like RAID-Z would
dictate that the maximum size should be RAID-Z (6+1).  Incidentally, you
will be happier with 6+1 than 15+1 for most cases.

For 2-way mirrors, then you would want to go with rotating pairs of 1/2
of a
FLX380 array.

For RAID-Z2, dual parity, you would implement RAID-Z2(5+2).  In general,
RAID-Z2 would give you the best data availability and data loss
protection
along with relatively good available space.  Caveat: I can''t say when
RAID-Z2
will be available for non-Express Solaris versions, I have zero
involvement with
Solaris release schedules.

More constraints below...
> The issues are:
> 
> 1)        This is a management nightmare doing it this way
automate
> 2)        It is way too easy to make a mistake and have a RAIDz group 
> that is not configured properly
automate

NB.  this isn''t as difficult to change later with ZFS than with some
other LVMs.  As long as the top-level requirements follow a consistent
design, changing the lower-level implementation can be done later
online.
Worry about the top-level vdevs which will be dictated by the number of
FLX380s as shown above.
> 3)        It would be extremely difficult to scale this type of 
> architecture if we later added a single FLX380 (6540) to the mix
The only (easy) way to scale while adding a single item, and still
retain the
same availability characteristics, is to use a mirror.

To go further down this line of thought would require the customer
to articulate how they would rank the following requirements:
	+ space
	+ availability
	+ performance
because you will need to trade these off.
  -- richard

Darren Dunham

2006-Sep-06 14:43 UTC

head link

[zfs-discuss] Need input on implementing a ZFS layout

> I would then have to create 80 RAIDz(6+1) Volumes and the process of
> creating these Volumes can be scripted. But -
> 
> 1) I would then have to create 80 mount points to mount each of these
> Volumes (?)
No.  Each of the RAIDZs that you create can be combined into a single
pool.  Data written to the pool will stripe across all the RAIDz
devices.
> 2) I would have no load balancing across mount points and I would have
> to specifically direct the files to a mount point using an algorithm of
> some design
>
> 3) A file landing on any one mount point would be constrained to the I/O
> of the underlying disk which would represent 1/80th of the potential
> available
See #1.
> 4) Expansion of the architecture, by adding in another single disk
> array, would be difficult and would probably be some form of data
> migration (?). For 800TB of data that would be unacceptable.
Today, you wouldn''t be able to do it easily.  In the future, you may be
able to expand the RAIDz device.  (or if you could remove a VDEV from a
pool, you could rotate through and remove each of the RAIDz devices
followed by an addition of a new (8-column) RAIDz).

-- 
Darren Dunham                                           ddunham at taos.com
Senior Technical Consultant         TAOS            http://www.taos.com/
Got some Dr Pepper?                           San Francisco, CA bay area
         < This line left intentionally blank to confuse you. >

Richard Elling - PAE

2006-Sep-06 14:52 UTC

head link

[zfs-discuss] Need input on implementing a ZFS layout

Oatway, Ted wrote:> Thanks for the response Richard. Forgive my ignorance but the following
> questions come to mind as I read your response.
> 
> I would then have to create 80 RAIDz(6+1) Volumes and the process of
> creating these Volumes can be scripted. But -
> 
> 1) I would then have to create 80 mount points to mount each of these
> Volumes (?)
No.  In ZFS, you create a zpool which has the devices and RAID configurations.
The file systems (plural) are then put in the zpool.  You could have one
file system, or thousands.  Each file system, by default, will be in the
heirarchy under the zpool name, or you can change it as you need.
> 2) I would have no load balancing across mount points and I would have
> to specifically direct the files to a mount point using an algorithm of
> some design
ZFS will dynamically stripe across the sets.  In traditional RAID terms,
this is like RAID-1+0, RAID-5+0, or RAID-6+0.
> 3) A file landing on any one mount point would be constrained to the I/O
> of the underlying disk which would represent 1/80th of the potential
> available
It would be spread across the 80 sets.
> 4) Expansion of the architecture, by adding in another single disk
> array, would be difficult and would probably be some form of data
> migration (?). For 800TB of data that would be unacceptable.
It depends on how you do this.  There are techniques for balancing which
might work, but they have availability trade-offs because you are decreasing
your diversity.  I''m encouraged by the fact that they are planning
ahead :-).

Also, unlike a traditional disk array or LVM software, ZFS will only copy
the data.  For example, in SVM, if you replace a whole disk, the resync will
copy the "data" for the whole disk.  For ZFS, it knows what data is
valid,
and will only copy the valid data.  Thus the resync time is based upon the
size of the data, not the size of the disk.  There are more nuances here,
but that covers it to the first order.
  -- richard

Richard Elling - PAE

2006-Sep-06 16:50 UTC

head link

[zfs-discuss] Need input on implementing a ZFS layout

There is another option.  I''ll call it "grow into your
storage."
Pre-ZFS, for most systems you would need to allocate the storage well
in advance of its use.  For the 7xFLX380 case using SVM and UFS, you
would typically setup the FLX380 LUNs, merge them together using SVM,
and newfs.  Growing is somewhat difficult for that size of systems
because UFS has some smallish limits (16 TBytes per file system, less
for older Solaris releases).  Planning this in advance is challenging
and the process for growing existing file systems or adding new file
systems would need careful attention.

By contrast, with ZFS we can add vdevs to the zpool on the fly to an
existing zpool and the file systems can immediately use the new space.

The reliability of devices is measured in operational hours.  So, for
a fixed reliability metric one way to improve your real-life happiness
is to reduce the operational hours.

Putting these together, it makes sense to only add disks as you need
the space.  Keep the disks turned off, until needed, to lengthen their
life.  In other words, grow into your storage.  This doesn''t work for
everyone, or every situation, but ZFS makes it an easy, viable option
to consider.
  -- richard

Torrey McMahon

2006-Sep-06 17:20 UTC

head link

[zfs-discuss] Need input on implementing a ZFS layout

+5

I''ve been saving my +1s for a few weeks now. ;)

Richard Elling - PAE wrote:> There is another option.  I''ll call it "grow into your
storage."
> Pre-ZFS, for most systems you would need to allocate the storage well
> in advance of its use.  For the 7xFLX380 case using SVM and UFS, you
> would typically setup the FLX380 LUNs, merge them together using SVM,
> and newfs.  Growing is somewhat difficult for that size of systems
> because UFS has some smallish limits (16 TBytes per file system, less
> for older Solaris releases).  Planning this in advance is challenging
> and the process for growing existing file systems or adding new file
> systems would need careful attention.
>
> By contrast, with ZFS we can add vdevs to the zpool on the fly to an
> existing zpool and the file systems can immediately use the new space.
>
> The reliability of devices is measured in operational hours.  So, for
> a fixed reliability metric one way to improve your real-life happiness
> is to reduce the operational hours.
>
> Putting these together, it makes sense to only add disks as you need
> the space.  Keep the disks turned off, until needed, to lengthen their
> life.  In other words, grow into your storage.  This doesn''t work
for
> everyone, or every situation, but ZFS makes it an easy, viable option
> to consider.
>  -- richard
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Robert Milkowski

2006-Sep-06 17:54 UTC

head link

[zfs-discuss] Re: Need input on implementing a ZFS layout

However performance will be much worse as data will be striped to only those
mirrors already available.

However is performance isn''t an issue it could be interesting.
 
 
This message posted from opensolaris.org

Christine Tran

2006-Sep-06 18:20 UTC

head link

[zfs-discuss] Need input on implementing a ZFS layout

This is a most interesting thread.  I''m a little be-fuddled, though. 
How will ZFS know to select the RAID-Z2 stripes from each FLX380, 
because if it stripes the (5+2) from the LUNS within one FLX380, this 
will not help if one frame goes irreplaceably out of service.

Let''s say the devices are named thus (and I''m making this up):

/devices/../../SUNW,qlc at 0/vol at 0,0/WWN:sliceno

qlc at x denotes the FLX380 frame, [0-6]
vol at m,n denotes the virtual disk,LUN, [0-19],[0-3]

How do I know that my stripes are rotated among qlc at 0, qlc at 1, ... qlc at
7?
  When I make pools I don''t give the raw device name, and ZFS may not 
know it has selected its (5+2) stripes from one frame.  This placement 
is for redundancy, but then will I be wasting the other 79 spindles in 
each frame?  It''s not just 7 giant disk.

If I needed to see this for myself, or show it to a customer, what test 
may I set up to observe RAID-Z2 in action, so that I/O are observed to 
be spread among the 7 frames?  I''m not yet comfortable with giving ZFS 
entire control over my disks without verification.

Darren Dunham

2006-Sep-06 19:33 UTC

head link

[zfs-discuss] Need input on implementing a ZFS layout

> Let''s say the devices are named thus (and I''m making this
up):
> 
> /devices/../../SUNW,qlc at 0/vol at 0,0/WWN:sliceno
> 
> qlc at x denotes the FLX380 frame, [0-6]
> vol at m,n denotes the virtual disk,LUN, [0-19],[0-3]
> 
> How do I know that my stripes are rotated among qlc at 0, qlc at 1,
> ... qlc at 7? 
Today, you''d have to create each of the VDEVs to explicitly use one LUN
from each array.  There''s no parameter for ZFS to pick them
automatically.

-- 
Darren Dunham                                           ddunham at taos.com
Senior Technical Consultant         TAOS            http://www.taos.com/
Got some Dr Pepper?                           San Francisco, CA bay area
         < This line left intentionally blank to confuse you. >

Richard Elling - PAE

2006-Sep-06 20:17 UTC

head link

[zfs-discuss] Need input on implementing a ZFS layout

Darren Dunham wrote:>> Let''s say the devices are named thus (and I''m making
this up):
>>
>> /devices/../../SUNW,qlc at 0/vol at 0,0/WWN:sliceno
>>
>> qlc at x denotes the FLX380 frame, [0-6]
>> vol at m,n denotes the virtual disk,LUN, [0-19],[0-3]
>>
>> How do I know that my stripes are rotated among qlc at 0, qlc at 1,
>> ... qlc at 7? 
> 
> Today, you''d have to create each of the VDEVs to explicitly use
one LUN
> from each array.  There''s no parameter for ZFS to pick them
> automatically.
yep, something like:
	# zpool create mybigzpool \
	  raidz2 c10t0d0 c11t0d0 c12t0d0 c13t0d0 c14t0d0 c15t0d0 c16t0d0 \
	  raidz2 c10t0d1 c11t0d1 c12t0d1 c13t0d1 c14t0d1 c15t0d1 c16t0d1 \
	...
	  raidz2 c10t0dN c11t0dN c12t0dN c13t0dN c14t0dN c15t0dN c16t0dN

Obviously the c#t#d# would need to match your hardware, but you should
be able to see the pattern.

Later, you could add:
	# zpool add mybigzpool \
	  raidz2 c10t0dM c11t0dM c12t0dM c13t0dM c14t0dM c15t0dM c16t0dM

  -- richard

zfs discuss - Sep 2006 - Need input on implementing a ZFS layout

[zfs-discuss] Need input on implementing a ZFS layout

[zfs-discuss] Need input on implementing a ZFS layout

[zfs-discuss] Need input on implementing a ZFS layout

[zfs-discuss] Need input on implementing a ZFS layout

[zfs-discuss] Need input on implementing a ZFS layout

[zfs-discuss] Need input on implementing a ZFS layout

[zfs-discuss] Need input on implementing a ZFS layout

[zfs-discuss] Re: Need input on implementing a ZFS layout

[zfs-discuss] Need input on implementing a ZFS layout

[zfs-discuss] Need input on implementing a ZFS layout

[zfs-discuss] Need input on implementing a ZFS layout