I am trying to determine the best way to move forward with about 35 x86 X4200''s Each box has 4x 73GB internal drives. All the boxes will be built using Solaris 10 11/06. Additionally, these boxes are part of a highly available production environment with an uptime expectation of 6 9''s ( just a few seconds per month unscheduled downtime allowed) Ideally, I would like to use a single RaidZ2 pool of all 4 disks, but apparently that is not supported yet. I understand there is the ZFSmount software for making a ZFS root, but I don''t think I want to use that for an environment of this grade and I can''t wait until Sun comes out with it integrated later this year...have to use 11/06 For perspective, these systems are currently running using pure UFS. With only 2 of the 4 disks being used in a software raid 1 / = 5GB /var = 5GB /tmp = 4GB /home = 2GB /data = 50GB I am looking for recommendations on how to maximize the use of ZFS and minimize the use of UFS without resorting to anything "experimental". So assuming that each 73GB disk yields 70GB usable space... Would it make sense to create a UFS root partition of 5GB that is a 4 way mirror across all 4 disks? I haven''t used SVM to create these types of mirrors before so if anyone has any experience here let me know. My expectation is that up to any 3 of the 4 disks could fail while leaving the root partition intact. Basically, every time root has data updated that data would be written 3 times more to each other disk So this would leave each disk with 68GB of free space. I would then create a 4GB UFS /tmp (swap) partition that would be 4 way mirrored across the remaining 3 disks just as I am suggesting above with the root partition. So again, up to any 3 disks could fail and the swap filesystem would still be intact. This would leave each disk with 64GB of free space, totaling 256GB. I would then create a single ZFS pool of all the remaining freespace on each of the 4 disks. How should this be done? Perhaps a form of mirring? What would be the difference in doing? zpool create tank mirror c1d0 c2d0 c3d0 c4d0 or zpool create tank mirror c1d0 c2d0 mirror c3d0 c4d0 Would it be better to use RaidZ with a hotspare or RAIDZ2 I would like /data, /home, and /var to be able to grow as needed and be able to withstand at least 2 disk failures (doesn''t have to be any 2). I am open to using a hotspare Suggestions? This message posted from opensolaris.org
On 06 March, 2007 - Matt B sent me these 2,5K bytes:> I am trying to determine the best way to move forward with about 35 x86 X4200''s > Each box has 4x 73GB internal drives. > > This would leave each disk with 64GB of free space, totaling 256GB. I > would then create a single ZFS pool of all the remaining freespace on > each of the 4 disks. > > How should this be done? > > Perhaps a form of mirring? What would be the difference in doing? > zpool create tank mirror c1d0 c2d0 c3d0 c4d064GB usable space, any 3 disks can die.> or > zpool create tank mirror c1d0 c2d0 mirror c3d0 c4d0128GB usable space, 1-2 disks can die.> Would it be better to use RaidZ with a hotspare or RAIDZ2Raidz + hotspare: 128GB usable space, 1 disk can die .. <pause and hope that nothing bad happens, wait for resilver> .. 1 more disk can die.. raidz2: 128GB usable space, any 2 disks can die at the same time.> I would like /data, /home, and /var to be able to grow as needed and > be able to withstand at least 2 disk failures (doesn''t have to be any > 2). I am open to using a hotspare4 way mirror has highest read performance, then 2+2, then probably raidz and finally raidz2. It all depends on your tradeoff of security vs space vs performance. http://blogs.sun.com/roch/entry/when_to_and_not_to http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
Good timing, I''d like some feedback for some work I''m doing below... Matt B wrote:> I am trying to determine the best way to move forward with about 35 x86 X4200''s > Each box has 4x 73GB internal drives.Cool. Nice box.> All the boxes will be built using Solaris 10 11/06. Additionally, these boxes > are part of a highly available production environment with an uptime expectation > of 6 9''s ( just a few seconds per month unscheduled downtime allowed)Just for my curiosity, how do you measure 6 9''s?> Ideally, I would like to use a single RaidZ2 pool of all 4 disks, but apparently > that is not supported yet. I understand there is the ZFSmount software for making > a ZFS root, but I don''t think I want to use that for an environment of this grade > and I can''t wait until Sun comes out with it integrated later this year...have to > use 11/06It is supported to use 4 disks in a pool, but it isn''t yet supported to use ZFS for the root file system. So, you''ll end up mixing UFS and ZFS on the same disk, as a likely option.> For perspective, these systems are currently running using pure UFS. With only 2 > of the 4 disks being used in a software raid 1 > / = 5GB > /var = 5GB > /tmp = 4GB > /home = 2GB > /data = 50GBI''m not a fan of separate /var, it just complicates things. I''ll also presume that by "/tmp" you really mean "swap"> I am looking for recommendations on how to maximize the use of ZFS and minimize > the use of UFS without resorting to anything "experimental".Put / in UFS, swap as raw, and everything else in a zpool. I would mirror /+swap on two disks, with the other two disks used as a LiveUpgrade alternate boot environment. When you patch or upgrade, you will normally have better availability (shorter planned outages) with LiveUpgrade. Also, you''ll be able to roll back to the previous boot environment, potentially saving more time.> So assuming that each 73GB disk yields 70GB usable space... > Would it make sense to create a UFS root partition of 5GB that is a 4 way mirror > across all 4 disks? I haven''t used SVM to create these types of mirrors before so > if anyone has any experience here let me know. My expectation is that up to any 3 > of the 4 disks could fail while leaving the root partition intact. Basically, > every time root has data updated that data would be written 3 times more to each > other diskI don''t see any practical gain for a 4-way mirror over a 3-way mirror. With such configs you are much more likely to see some other fault which will ruin your day (eg. accidental rm)> So this would leave each disk with 68GB of free space. I would then create a > 4GB UFS /tmp (swap) partition that would be 4 way mirrored across the remaining > 3 disks just as I am suggesting above with the root partition. So again, up to > any 3 disks could fail and the swap filesystem would still be intact. > > This would leave each disk with 64GB of free space, totaling 256GB. I would then > create a single ZFS pool of all the remaining freespace on each of the 4 disks. > > How should this be done? > > Perhaps a form of mirring? What would be the difference in doing? > zpool create tank mirror c1d0 c2d0 c3d0 c4d0 > or > zpool create tank mirror c1d0 c2d0 mirror c3d0 c4d0 > > Would it be better to use RaidZ with a hotspare or RAIDZ2 > > I would like /data, /home, and /var to be able to grow as needed and be able to > withstand at least 2 disk failures (doesn''t have to be any 2). I am open to using > a hotspare > > Suggestions?Prioritize your requirements. Then take a look at the attached spreadsheet. What the spreadsheet contains is a report from RAIDoptimizer for the type of disk you''ll be likely to have, based upon the disk vendor''s data sheet (Seagate Saviio). The algorithms are described in my blog, http://blogs.sun.com/relling and an enterprising person could key them into a spreadsheet. There are 4 main portions of the data: + configuration info: raid type, set size, spares, available space + mean time to data loss (MTTDL) info: for two different MTTDL models + performance info: random, read iops and media bandwidths + mean time between services (MTBS) info: how often do you expect to repair something I''m particularly interested in feedback on MTBS. The various MTBS models consider the immediate effect of having a bunch of disks, and the deferred repair strategies of waiting until you have to replace a disk, based upon the RAID config and spares. In any case, a higher MTBS is better, though there is more risk for each MTBS model. Let me know if this is helpful. As Tomas said, you could look at some of this data in graphical form on my blog, though those graphs assume 46 disks instead of 4. For 4 disks, you have far fewer possible combinations, so it fits reasonably in a spreadsheet. Caveat: the numbers are computed by algorithms and the code has not yet been verified that it properly implements the algorithms. Models are simplifications of real life, don''t expect real life to follow a model. If you do follow models, then note that Elizabeth Hurley is off the market :-) -- richard -------------- next part -------------- A non-text attachment was scrubbed... Name: for_matt.ods Type: application/vnd.oasis.opendocument.spreadsheet Size: 10974 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070306/7379ed0b/attachment.ods>
Thanks for responses. There is a lot there I am looking forward to digesting. Right off the bat though I wanted to bring up something I found just before reading this reply as the answer to this question would automatically answer some other questinos There is a ZFS best practices wiki at http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#General_Storage_Pool_Performance_Considerations that makes a couple points: *Swap space - Because ZFS caches data in kernel addressable memory, the kernel sizes will likely be larger than with other file systems. Configure additional disk-based swap to account for this difference. You can use the size of physical memory as an upper bound to the extra amount of swap space that might be required. Do not use slices on the same disk for both swap space and ZFS file systems. Keep the swap areas separate from the ZFS file systems. *Do not use slices for storage pools that are intended for production use. So after reading this it seems that with only 4 disks to work with and the fact that UFS for root (initial install) is still required that my only option to conform to the best practice is to use two disks with UFS/RAid1 leaving only the remaining two disks for 100% ZFS. Additionally, the swap partition would have to go on the UFS set of disks to keep it seperate from the ZFS set of disks If I am misinterpreting the wiki please let me know. There are some tradeoffs here. I would prefer to use a 4 way mirrored slice of a UFS root and a 4 way mirrored slice of UFS swap, and then leave equal slices free for ZFS, but then it sounds like I would have to risk not following the best practice and have to mess with SVM. The nice thing is I could be looking at 128GB yield with a decent level fault tolerance. With Zpooling could I take each of the two 64GB slices and place them into two zfs stripes (raid0) and then join those two into a zfs mirror? Seems like then I could get a 128GB yield without having to use the RaidZ/Hot or Raidz2 which according the links I just skimmed performs/lasts below mirroring The other option I described above, I could just slap the first two disks into a HW raid 1 as the x4200''s support 2 disk raid1''s and then slap the remaining 2 disks into ZFS and this (I think) would not be violating the best practice? Any thoughts on the best practice points I am raising? It disturbs me that it would make a statement like "don''t use slices for production". This message posted from opensolaris.org
On March 7, 2007 8:50:53 AM -0800 Matt B <mattbreedlove at yahoo.com> wrote:> Any thoughts on the best practice points I am raising? It disturbs me > that it would make a statement like "don''t use slices for production".I think that''s just a performance thing. -frank
Frank Cusack wrote:> On March 7, 2007 8:50:53 AM -0800 Matt B <mattbreedlove at yahoo.com> wrote: >> Any thoughts on the best practice points I am raising? It disturbs me >> that it would make a statement like "don''t use slices for production". > > I think that''s just a performance thing.yep, for those systems with lots of disks. -- richard
Matt B wrote:> Thanks for responses. There is a lot there I am looking forward to digesting. > Right off the bat though I wanted to bring up something I found just before > reading this reply as the answer to this question would automatically answer > some other questinos > > There is a ZFS best practices wiki at> http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#General_Storage_Pool_Performance_Considerations> > that makes a couple points: > > *Swap space - Because ZFS caches data in kernel addressable memory, the kernel > sizes will likely be larger than with other file systems. Configure additional > disk-based swap to account for this difference. You can use the size of physical > memory as an upper bound to the extra amount of swap space that might be required. > Do not use slices on the same disk for both swap space and ZFS file systems. > Keep the swap areas separate from the ZFS file systems.This recommendation is only suitable for low memory systems with lots of disks. Clearly, it would be impractical for a system with a single disk.> *Do not use slices for storage pools that are intended for production use. > > So after reading this it seems that with only 4 disks to work with and the fact > that UFS for root (initial install) is still required that my only option to conform > to the best practice is to use two disks with UFS/RAid1 leaving only the remaining > two disks for 100% ZFS. Additionally, the swap partition would have to go on the > UFS set of disks to keep it seperate from the ZFS set of disks > > If I am misinterpreting the wiki please let me know.The best thing about best practices is that there are so many of them :-/ I''ll see if I can clarify in the wiki.> There are some tradeoffs here. I would prefer to use a 4 way mirrored slice of a > UFS root and a 4 way mirrored slice of UFS swap, and then leave equal slices free > for ZFS, but then it sounds like I would have to risk not following the best > practice and have to mess with SVM. The nice thing is I could be looking at 128GB > yield with a decent level fault tolerance.> With Zpooling could I take each of the two 64GB slices and place them into two zfs > stripes (raid0) and then join those two into a zfs mirror? Seems like then I could > get a 128GB yield without having to use the RaidZ/Hot or Raidz2 which according > the links I just skimmed performs/lasts below mirroringBe careful, with ZFS you don''t take a stripe and mirror it (RAID-0+1), you take a mirror and stripe it (RAID-1+0). For example, you would do: zpool create mycoolpool mirror c_d_t_s_ c_d_t_s_ mirror c_d_t_s_ c_d_t_s_> The other option I described above, I could just slap the first two disks into a > HW raid 1 as the x4200''s support 2 disk raid1''s and then slap the remaining 2 disks > into ZFS and this (I think) would not be violating the best practice?Yes, this would work fine. It would simplify your boot and OS install/upgrade. I''d still recommend planning on using LiveUpgrade -- leave a spare slice for an alternate boot environment.> Any thoughts on the best practice points I am raising? It disturbs me that it would > make a statement like "don''t use slices for production".Sometimes it is not what you say, it is how you say it. -- richard
So it sounds like the consensus is that I should not worry about using slices with ZFS and the swap best practice doesn''t really apply to my situation of a 4 disk x4200. So in summary(please confirm) this is what we are saying is a safe bet for using in a highly available production environment? With 4x73 gig disks yielding 70GB each: 5GB for root which is UFS and mirrored 4 ways using SVM. 8GB for swap which is raw and mirrored across first two disks (optional: or no liveupgrade and 4 way mirror this swap partition) 8GB for LiveUpgrade which is mirrored across the third and fourth two disks This leaves 57GB of free space on each of the 4 disks in slices One zfs pool will be created containing the 4 slices the first two slices will be used in a zmirror yielding 57GB The last two slices will be used in a zmirror yielding 57GB Then a zstripe (raid0) will be layed over the two zmirrors yielding 114GB usable space while able to sustain any 2 drives failing without a loss in data Thanks P.S. Availability is determined by using a synthetic SLA monitor that operates on 2 minute cycles evaluating against a VIP by an external third party. If there are no errors in the report for the month we hit 100%, I think even one error (due to the 2 minute window) puts us below 6 9''s..so we basically have a zero tolerance standard to hit the sla and not get penalized monetarily This message posted from opensolaris.org
Wade.Stuart at fallon.com
2007-Mar-07 19:08 UTC
[zfs-discuss] Re: ZFS/UFS layout for 4 disk servers
zfs-discuss-bounces at opensolaris.org wrote on 03/07/2007 12:31:14 PM:> So it sounds like the consensus is that I should not worry about > using slices with ZFS > and the swap best practice doesn''t really apply to my situation of a > 4 disk x4200. > > So in summary(please confirm) this is what we are saying is a safe > bet for using in a highly available production environment? > > With 4x73 gig disks yielding 70GB each: > > 5GB for root which is UFS and mirrored 4 ways using SVM. > 8GB for swap which is raw and mirrored across first two disks > (optional: or no liveupgrade and 4 way mirror this swap partition) > 8GB for LiveUpgrade which is mirrored across the third and fourth twodisks> This leaves 57GB of free space on each of the 4 disks in slices > One zfs pool will be created containing the 4 slices > the first two slices will be used in a zmirror yielding 57GB > The last two slices will be used in a zmirror yielding 57GB > Then a zstripe (raid0) will be layed over the two zmirrors yielding > 114GB usable space while able to sustain any 2 drives failing > without a loss in dataNo, you will be able to sustain up to one disk in each of the two disk pairs failing at any time with no data loss. Lose two disks in the mirror pair set and you lose data (and system panic) -- slightly different then "any two disks".> > Thanks > > P.S. > Availability is determined by using a synthetic SLA monitor that > operates on 2 minute cycles evaluating against a VIP by an external > third party. If there are no errors in the report for the month we > hit 100%, I think even one error (due to the 2 minute window) puts > us below 6 9''s..so we basically have a zero tolerance standard to > hit the sla and not get penalized monetarily > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Matt B wrote:> Any thoughts on the best practice points I am raising? It disturbs me > that it would make a statement like "don''t use slices for > production".ZFS turns on write cache on the disk if you give it the entire disk to manage. It is good for performance. So, you should use whole disks when ever possible. Slices work too, but write cache for the disk will not be turned on by zfs. Cheers Manoj
Manoj Joseph writes: > Matt B wrote: > > Any thoughts on the best practice points I am raising? It disturbs me > > that it would make a statement like "don''t use slices for > > production". > > ZFS turns on write cache on the disk if you give it the entire disk to > manage. It is good for performance. So, you should use whole disks when > ever possible. > Just a small clarification to state that the extra performance that comes from having the "write cache on" applies mostly to disks that do not have other means of command concurrency (NCQ, CTQ). With NCQ/CTQ, the write cache setting should not matter much to ZFS performance. -r > Slices work too, but write cache for the disk will not be turned on by zfs. > > Cheers > Manoj > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Robert Milkowski
2007-Mar-08 11:38 UTC
[zfs-discuss] Re: ZFS/UFS layout for 4 disk servers
Hello Matt, Wednesday, March 7, 2007, 7:31:14 PM, you wrote: MB> So it sounds like the consensus is that I should not worry about using slices with ZFS MB> and the swap best practice doesn''t really apply to my situation of a 4 disk x4200. MB> So in summary(please confirm) this is what we are saying is a MB> safe bet for using in a highly available production environment? MB> With 4x73 gig disks yielding 70GB each: MB> 5GB for root which is UFS and mirrored 4 ways using SVM. MB> 8GB for swap which is raw and mirrored across first two disks MB> (optional: or no liveupgrade and 4 way mirror this swap partition) MB> 8GB for LiveUpgrade which is mirrored across the third and fourth two disks MB> This leaves 57GB of free space on each of the 4 disks in slices MB> One zfs pool will be created containing the 4 slices MB> the first two slices will be used in a zmirror yielding 57GB MB> The last two slices will be used in a zmirror yielding 57GB MB> Then a zstripe (raid0) will be layed over the two zmirrors MB> yielding 114GB usable space while able to sustain any 2 drives failing without a loss in data Eventually if you care about how much storage is available then: 1. 8GB on two disks for / in mirrored config (SVM) 2. 8GB on another two disks for SWAP in mirrored config (SVM) 3. the rest of the disks for zfs a. raidz2 4 slices, capacity of 2x slice, bad random read performance b. raid-10 4 slices, capacity of 2x slice, good read performance, less reliability than a. You loose ability to do LU, but you gain some storage. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Frank Cusack writes: > On March 7, 2007 8:50:53 AM -0800 Matt B <mattbreedlove at yahoo.com> wrote: > > Any thoughts on the best practice points I am raising? It disturbs me > > that it would make a statement like "don''t use slices for production". > > I think that''s just a performance thing. > Right, I think what would be very unoptimal from ZFS standpoint would be to configure 2 slices from _one_ disk into a given zpool. This would send the I/O scheduler on a tangent, but it would nevertheless still work. > -frank > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss