IHAC that has 560+ LUNs that will be assigned to ZFS Pools and some level of protection. The LUNs are provided by seven Sun StorageTek FLX380s. Each FLX380 is configured with 20 Virtual Disks. Each Virtual Disk presents four Volumes/LUNs. (4 Volumes x 20 Virtual Disks x 7 Disk Arrays = 560 LUNs in total) We want to protect against all possible scenarios including the loss of a Virtual Disk (which would take out four Volumes) and the loss of a FLX380 (which would take out 80 Volumes). Today the customer has taken some number of LUNs from each of the arrays and put them into one ZFS Pool. They then create R5(15+1) RAIDz virtual disks (??) manually selecting LUNs to try and get the required level of redundancy. The issues are: 1) This is a management nightmare doing it this way 2) It is way too easy to make a mistake and have a RAIDz group that is not configured properly 3) It would be extremely difficult to scale this type of architecture if we later added a single FLX380 (6540) to the mix I do not yet understand ZFS but I have some ideas on how I think it works and it seems to me that surely ZFS can handle this in a more eloquent manner than what the customer is doing today. So while I am coming up to speed on how to architect solutions using ZFS can any one of you help me think this through to make sure the customer meets all of their objectives? This is somewhat of a time sensitive situation. Ted Oatway Principal Solutions Architect Office of the CTO, Public Sector Storage Sales Group Sun Microsystems ted.oatway at sun.com <mailto:ted.oatway at sun.com> 206.276.0769 Office 206.276.0769 Mobile NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060905/2bdaf739/attachment.html>
Richard Elling - PAE
2006-Sep-06 00:50 UTC
[zfs-discuss] Need input on implementing a ZFS layout
Oatway, Ted wrote:> IHAC that has 560+ LUNs that will be assigned to ZFS Pools and some > level of protection. The LUNs are provided by seven Sun StorageTek > FLX380s. Each FLX380 is configured with 20 Virtual Disks. Each Virtual > Disk presents four Volumes/LUNs. (4 Volumes x 20 Virtual Disks x 7 Disk > Arrays = 560 LUNs in total) > > We want to protect against all possible scenarios including the loss of > a Virtual Disk (which would take out four Volumes) and the loss of a > FLX380 (which would take out 80 Volumes).This means that your maximum number of columns is N, where N is the whole device you could stand to lose before data availability is compromised. In this case, that number is 7 (FLX380s).> Today the customer has taken some number of LUNs from each of the arrays > and put them into one ZFS Pool. They then create R5(15+1) RAIDz virtual > disks (??) manually selecting LUNs to try and get the required level of > redundancy.Because your limit is 7, then a single parity solution like RAID-Z would dictate that the maximum size should be RAID-Z (6+1). Incidentally, you will be happier with 6+1 than 15+1 for most cases. For 2-way mirrors, then you would want to go with rotating pairs of 1/2 of a FLX380 array. For RAID-Z2, dual parity, you would implement RAID-Z2(5+2). In general, RAID-Z2 would give you the best data availability and data loss protection along with relatively good available space. Caveat: I can''t say when RAID-Z2 will be available for non-Express Solaris versions, I have zero involvement with Solaris release schedules. More constraints below...> The issues are: > > 1) This is a management nightmare doing it this wayautomate> 2) It is way too easy to make a mistake and have a RAIDz group > that is not configured properlyautomate NB. this isn''t as difficult to change later with ZFS than with some other LVMs. As long as the top-level requirements follow a consistent design, changing the lower-level implementation can be done later online. Worry about the top-level vdevs which will be dictated by the number of FLX380s as shown above.> 3) It would be extremely difficult to scale this type of > architecture if we later added a single FLX380 (6540) to the mixThe only (easy) way to scale while adding a single item, and still retain the same availability characteristics, is to use a mirror. To go further down this line of thought would require the customer to articulate how they would rank the following requirements: + space + availability + performance because you will need to trade these off. -- richard
Thanks for the response Richard. Forgive my ignorance but the following questions come to mind as I read your response. I would then have to create 80 RAIDz(6+1) Volumes and the process of creating these Volumes can be scripted. But - 1) I would then have to create 80 mount points to mount each of these Volumes (?) 2) I would have no load balancing across mount points and I would have to specifically direct the files to a mount point using an algorithm of some design 3) A file landing on any one mount point would be constrained to the I/O of the underlying disk which would represent 1/80th of the potential available 4) Expansion of the architecture, by adding in another single disk array, would be difficult and would probably be some form of data migration (?). For 800TB of data that would be unacceptable. Ted Oatway Sun Microsystems 206.276.0769 Mobile -----Original Message----- From: Richard.Elling at sun.com [mailto:Richard.Elling at sun.com] Sent: Tuesday, September 05, 2006 5:50 PM To: Oatway, Ted Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Need input on implementing a ZFS layout Oatway, Ted wrote:> IHAC that has 560+ LUNs that will be assigned to ZFS Pools and some > level of protection. The LUNs are provided by seven Sun StorageTek > FLX380s. Each FLX380 is configured with 20 Virtual Disks. Each Virtual> Disk presents four Volumes/LUNs. (4 Volumes x 20 Virtual Disks x 7Disk> Arrays = 560 LUNs in total) > > We want to protect against all possible scenarios including the lossof> a Virtual Disk (which would take out four Volumes) and the loss of a > FLX380 (which would take out 80 Volumes).This means that your maximum number of columns is N, where N is the whole device you could stand to lose before data availability is compromised. In this case, that number is 7 (FLX380s).> Today the customer has taken some number of LUNs from each of thearrays> and put them into one ZFS Pool. They then create R5(15+1) RAIDzvirtual> disks (??) manually selecting LUNs to try and get the required levelof> redundancy.Because your limit is 7, then a single parity solution like RAID-Z would dictate that the maximum size should be RAID-Z (6+1). Incidentally, you will be happier with 6+1 than 15+1 for most cases. For 2-way mirrors, then you would want to go with rotating pairs of 1/2 of a FLX380 array. For RAID-Z2, dual parity, you would implement RAID-Z2(5+2). In general, RAID-Z2 would give you the best data availability and data loss protection along with relatively good available space. Caveat: I can''t say when RAID-Z2 will be available for non-Express Solaris versions, I have zero involvement with Solaris release schedules. More constraints below...> The issues are: > > 1) This is a management nightmare doing it this wayautomate> 2) It is way too easy to make a mistake and have a RAIDz group > that is not configured properlyautomate NB. this isn''t as difficult to change later with ZFS than with some other LVMs. As long as the top-level requirements follow a consistent design, changing the lower-level implementation can be done later online. Worry about the top-level vdevs which will be dictated by the number of FLX380s as shown above.> 3) It would be extremely difficult to scale this type of > architecture if we later added a single FLX380 (6540) to the mixThe only (easy) way to scale while adding a single item, and still retain the same availability characteristics, is to use a mirror. To go further down this line of thought would require the customer to articulate how they would rank the following requirements: + space + availability + performance because you will need to trade these off. -- richard
> I would then have to create 80 RAIDz(6+1) Volumes and the process of > creating these Volumes can be scripted. But - > > 1) I would then have to create 80 mount points to mount each of these > Volumes (?)No. Each of the RAIDZs that you create can be combined into a single pool. Data written to the pool will stripe across all the RAIDz devices.> 2) I would have no load balancing across mount points and I would have > to specifically direct the files to a mount point using an algorithm of > some design > > 3) A file landing on any one mount point would be constrained to the I/O > of the underlying disk which would represent 1/80th of the potential > availableSee #1.> 4) Expansion of the architecture, by adding in another single disk > array, would be difficult and would probably be some form of data > migration (?). For 800TB of data that would be unacceptable.Today, you wouldn''t be able to do it easily. In the future, you may be able to expand the RAIDz device. (or if you could remove a VDEV from a pool, you could rotate through and remove each of the RAIDz devices followed by an addition of a new (8-column) RAIDz). -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
Richard Elling - PAE
2006-Sep-06 14:52 UTC
[zfs-discuss] Need input on implementing a ZFS layout
Oatway, Ted wrote:> Thanks for the response Richard. Forgive my ignorance but the following > questions come to mind as I read your response. > > I would then have to create 80 RAIDz(6+1) Volumes and the process of > creating these Volumes can be scripted. But - > > 1) I would then have to create 80 mount points to mount each of these > Volumes (?)No. In ZFS, you create a zpool which has the devices and RAID configurations. The file systems (plural) are then put in the zpool. You could have one file system, or thousands. Each file system, by default, will be in the heirarchy under the zpool name, or you can change it as you need.> 2) I would have no load balancing across mount points and I would have > to specifically direct the files to a mount point using an algorithm of > some designZFS will dynamically stripe across the sets. In traditional RAID terms, this is like RAID-1+0, RAID-5+0, or RAID-6+0.> 3) A file landing on any one mount point would be constrained to the I/O > of the underlying disk which would represent 1/80th of the potential > availableIt would be spread across the 80 sets.> 4) Expansion of the architecture, by adding in another single disk > array, would be difficult and would probably be some form of data > migration (?). For 800TB of data that would be unacceptable.It depends on how you do this. There are techniques for balancing which might work, but they have availability trade-offs because you are decreasing your diversity. I''m encouraged by the fact that they are planning ahead :-). Also, unlike a traditional disk array or LVM software, ZFS will only copy the data. For example, in SVM, if you replace a whole disk, the resync will copy the "data" for the whole disk. For ZFS, it knows what data is valid, and will only copy the valid data. Thus the resync time is based upon the size of the data, not the size of the disk. There are more nuances here, but that covers it to the first order. -- richard
Richard Elling - PAE
2006-Sep-06 16:50 UTC
[zfs-discuss] Need input on implementing a ZFS layout
There is another option. I''ll call it "grow into your storage." Pre-ZFS, for most systems you would need to allocate the storage well in advance of its use. For the 7xFLX380 case using SVM and UFS, you would typically setup the FLX380 LUNs, merge them together using SVM, and newfs. Growing is somewhat difficult for that size of systems because UFS has some smallish limits (16 TBytes per file system, less for older Solaris releases). Planning this in advance is challenging and the process for growing existing file systems or adding new file systems would need careful attention. By contrast, with ZFS we can add vdevs to the zpool on the fly to an existing zpool and the file systems can immediately use the new space. The reliability of devices is measured in operational hours. So, for a fixed reliability metric one way to improve your real-life happiness is to reduce the operational hours. Putting these together, it makes sense to only add disks as you need the space. Keep the disks turned off, until needed, to lengthen their life. In other words, grow into your storage. This doesn''t work for everyone, or every situation, but ZFS makes it an easy, viable option to consider. -- richard
Torrey McMahon
2006-Sep-06 17:20 UTC
[zfs-discuss] Need input on implementing a ZFS layout
+5 I''ve been saving my +1s for a few weeks now. ;) Richard Elling - PAE wrote:> There is another option. I''ll call it "grow into your storage." > Pre-ZFS, for most systems you would need to allocate the storage well > in advance of its use. For the 7xFLX380 case using SVM and UFS, you > would typically setup the FLX380 LUNs, merge them together using SVM, > and newfs. Growing is somewhat difficult for that size of systems > because UFS has some smallish limits (16 TBytes per file system, less > for older Solaris releases). Planning this in advance is challenging > and the process for growing existing file systems or adding new file > systems would need careful attention. > > By contrast, with ZFS we can add vdevs to the zpool on the fly to an > existing zpool and the file systems can immediately use the new space. > > The reliability of devices is measured in operational hours. So, for > a fixed reliability metric one way to improve your real-life happiness > is to reduce the operational hours. > > Putting these together, it makes sense to only add disks as you need > the space. Keep the disks turned off, until needed, to lengthen their > life. In other words, grow into your storage. This doesn''t work for > everyone, or every situation, but ZFS makes it an easy, viable option > to consider. > -- richard > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Robert Milkowski
2006-Sep-06 17:54 UTC
[zfs-discuss] Re: Need input on implementing a ZFS layout
However performance will be much worse as data will be striped to only those mirrors already available. However is performance isn''t an issue it could be interesting. This message posted from opensolaris.org
Christine Tran
2006-Sep-06 18:20 UTC
[zfs-discuss] Need input on implementing a ZFS layout
This is a most interesting thread. I''m a little be-fuddled, though. How will ZFS know to select the RAID-Z2 stripes from each FLX380, because if it stripes the (5+2) from the LUNS within one FLX380, this will not help if one frame goes irreplaceably out of service. Let''s say the devices are named thus (and I''m making this up): /devices/../../SUNW,qlc at 0/vol at 0,0/WWN:sliceno qlc at x denotes the FLX380 frame, [0-6] vol at m,n denotes the virtual disk,LUN, [0-19],[0-3] How do I know that my stripes are rotated among qlc at 0, qlc at 1, ... qlc at 7? When I make pools I don''t give the raw device name, and ZFS may not know it has selected its (5+2) stripes from one frame. This placement is for redundancy, but then will I be wasting the other 79 spindles in each frame? It''s not just 7 giant disk. If I needed to see this for myself, or show it to a customer, what test may I set up to observe RAID-Z2 in action, so that I/O are observed to be spread among the 7 frames? I''m not yet comfortable with giving ZFS entire control over my disks without verification.
> Let''s say the devices are named thus (and I''m making this up): > > /devices/../../SUNW,qlc at 0/vol at 0,0/WWN:sliceno > > qlc at x denotes the FLX380 frame, [0-6] > vol at m,n denotes the virtual disk,LUN, [0-19],[0-3] > > How do I know that my stripes are rotated among qlc at 0, qlc at 1, > ... qlc at 7?Today, you''d have to create each of the VDEVs to explicitly use one LUN from each array. There''s no parameter for ZFS to pick them automatically. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
Richard Elling - PAE
2006-Sep-06 20:17 UTC
[zfs-discuss] Need input on implementing a ZFS layout
Darren Dunham wrote:>> Let''s say the devices are named thus (and I''m making this up): >> >> /devices/../../SUNW,qlc at 0/vol at 0,0/WWN:sliceno >> >> qlc at x denotes the FLX380 frame, [0-6] >> vol at m,n denotes the virtual disk,LUN, [0-19],[0-3] >> >> How do I know that my stripes are rotated among qlc at 0, qlc at 1, >> ... qlc at 7? > > Today, you''d have to create each of the VDEVs to explicitly use one LUN > from each array. There''s no parameter for ZFS to pick them > automatically.yep, something like: # zpool create mybigzpool \ raidz2 c10t0d0 c11t0d0 c12t0d0 c13t0d0 c14t0d0 c15t0d0 c16t0d0 \ raidz2 c10t0d1 c11t0d1 c12t0d1 c13t0d1 c14t0d1 c15t0d1 c16t0d1 \ ... raidz2 c10t0dN c11t0dN c12t0dN c13t0dN c14t0dN c15t0dN c16t0dN Obviously the c#t#d# would need to match your hardware, but you should be able to see the pattern. Later, you could add: # zpool add mybigzpool \ raidz2 c10t0dM c11t0dM c12t0dM c13t0dM c14t0dM c15t0dM c16t0dM -- richard