Greetings all- I have a new X4200 that I''m getting ready to deploy. It has four 146 GB SAS drives. I''d like to setup the box for maximum redundancy on the data stored on these drives. Unfortunately, it looks like ZFS boot/root aren''t really options at this time. The LSI Logic controller in this box only supports either a RAID0 array with all four disks, or a RAID 1 array with two disks--neither of which are very appealing to me. Ideally I''d like to have at least 300 gigs of storage available to the users, or more if I can do it with something like a RAID 5 setup. My concern, however, is that the boot partition and root partitions have data redundancy. How would you setup this box? It''s primary used as a development server, running a myriad of applications. Thank you- John This message posted from opensolaris.org
On Tue, 7 Nov 2006, John Tracy wrote:> Greetings all- > I have a new X4200 that I''m getting ready to deploy. It has four 146 GB SAS drives. I''d like to setup the box for maximum redundancy on the data stored on these drives. Unfortunately, it looks like ZFS boot/root aren''t really options at this time. The LSI Logic controller in this box only supports either a RAID0 array with all four disks, or a RAID 1 array with two disks--neither of which are very appealing to me. > Ideally I''d like to have at least 300 gigs of storage available to the users, or more if I can do it with something like a RAID 5 setup. My concern, however, is that the boot partition and root partitions have data redundancy. > How would you setup this box? > It''s primary used as a development server, running a myriad of applications.Since you''ve posted this to zfs-discuss, I''m assuming that your goal is to find a way that you can take advantage of zfs on this box - if at all possible. So I''m going to propose a radical setup that I''m sure many will have issues with and which falls outside conventional/normal best practices, which in this case, would be to form 2 mirrors of 2 disks each using the built-in H/W RAID ctrl and you''re done. If this is too radical for you, it will at least provide food for thought. First your assertion that you want redundancy for root and boot is somewhat flawed. Let me explain; in the "old days", loosing root on a box was one of the worst possible user experiences - but that is simply not true today. If you keep the root filesystem pristine (more later) and just save off the config files you modify (/etc/hostname.*, /etc/passwd, /etc/group, /etc/shadow, /etc/hosts blah, blah) periodically, the root partition can be restored quickly and simply and then your config files restored. Consider the root partition disposable and replaceable. If you setup the system initially to net boot[1], then your root partition can be restored very quickly from the same set of files you used to load it initially! Since its being used as a development box, if the root disk dies, you push in a replacement, net boot it and restore your saved config files. Downtime will probably be around 30 minutes, assuming you keep a spare disk handy (in a locked, rack-mount drawer immediately adajacent to the x4200 machine). Next I mentioned keeping root pristine. I''ll also assume you''ll use the blastwave software reposition which installs software in /opt/csw by default. So first up, the disk layout config: on the boot disk: - 16Gb / root partation - 4 to 16Gb swap partition - 16Gb live upgrade partition - small lightly used /export/home partition - the rest of this disk will be un-allocated at this time with the other 3 disks, form a 3-way raidz pool, with the following broad plan for the initial zfs filesystems you''ll place in this pool: - a filesystem for shared home directories that will be shared into zones - additional swap vdev - a filesystem for your master zone (see below) - a filesystem for each zone you''ll define on this box - a filesystem for (one or more) junk zone(s) So now root is still pristine - no supplemental software has been loaded or added. First up, build a "master" zone. Its master, in the sense that it''ll be used to clone real working zones from, in which you will do *all* the "real" work. So create a fat zone (create -b), run "netservices limited" within it, add default user accounts, setup DNS etc. The more effort you put into building/configuration of this master zone, the easier it''ll be to add work zones to the box. Now that you have your master zone, use zfs clone to create "fat" zones for use as work areas. Within these work zones, you''ll install all your blastwave tools, compilers, tools etc. etc. You can arrange for the shared home directories to be automatically mounted when a user with a shared home logs into the zone (using the automounter). You''ll probably have some users who only have logins in certains zones etc. Next repeat the above and build/clone more work zones on a per project or per department or per whatever-makes-sense basis. You''ll apply zfs quotas on zones where you have concerns about the users gobbling up too much disk space. You''ll have one or more junkzones to allow experiments with the system config to be safely isolated. Use zfs send/recv to backup individual zones or datasets from within zones to another zfs server. If you elect to install Solaris, then (my recommendation) wait for Update 3. And you''ll be able to zfs clone zones by copying them - but not create them from snapshots. If you install Solaris Express or the latest/greatest OpenSolaris you''ll be able to create zones very quickly from a zfs snapshot of your master zone - saving you a good deal of time and disk space. Downsides: There are many. First off, you know that zones on zfs are not supported (yet). And that applying patches may break the box and render it entirely useless. And that, currently, zfs does not handle all disk errors gracefully. But all of these downsides will disappear over time and I believe that the tradeoff, in terms of usability etc. is worth the increased risk of running this radical system config. In any case, if you want to try this config, it''ll take you a couple of hours to build a mockup on the x4200 and you can experiment with it and decide if you can live with it. Email me offlist if you have any questions that you feel are off-topic for the zfs-discuss list. [1] use JET Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
Robert Milkowski
2006-Nov-07 23:22 UTC
[zfs-discuss] Best Practices recommendation on x4200
Hello John, Tuesday, November 7, 2006, 7:45:46 PM, you wrote: JT> Greetings all- JT> I have a new X4200 that I''m getting ready to deploy. It has JT> four 146 GB SAS drives. I''d like to setup the box for maximum JT> redundancy on the data stored on these drives. Unfortunately, it JT> looks like ZFS boot/root aren''t really options at this time. The JT> LSI Logic controller in this box only supports either a RAID0 JT> array with all four disks, or a RAID 1 array with two JT> disks--neither of which are very appealing to me. JT> Ideally I''d like to have at least 300 gigs of storage JT> available to the users, or more if I can do it with something like JT> a RAID 5 setup. My concern, however, is that the boot partition JT> and root partitions have data redundancy. JT> How would you setup this box? JT> It''s primary used as a development server, running a myriad of applications. Use SVM to mirror system, something like: d0 mirror of c0t0d0s0 and c0t1d0s0 / 2GB d5 mirror of c0t0d0s1 and c0t1d0s1 /var 2GB d10 mirror of c0t2d0s0 and c0t3d0s0 swap (2+2GB, to match above) an all 4 disks create s4 slice with the rest of the disk, should be equal on all disks. Then create raidz pool out of those slices. You should get above 400GB of usable storage. That way you''ve got mirrored root disks, mirrored swap on another two disks matching exactly the space used by / and /var and the rest of the disk for your data on zfs. ps. and of course you''ve got to create small slices for metadb''s. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Richard Elling - PAE
2006-Nov-07 23:54 UTC
[zfs-discuss] Best Practices recommendation on x4200
The best thing about best practices is that there are so many of them :-) Robert Milkowski wrote:> Hello John, > > Tuesday, November 7, 2006, 7:45:46 PM, you wrote: > > JT> Greetings all- > JT> I have a new X4200 that I''m getting ready to deploy. It has > JT> four 146 GB SAS drives. I''d like to setup the box for maximum > JT> redundancy on the data stored on these drives. Unfortunately, it > JT> looks like ZFS boot/root aren''t really options at this time. The > JT> LSI Logic controller in this box only supports either a RAID0 > JT> array with all four disks, or a RAID 1 array with two > JT> disks--neither of which are very appealing to me. > JT> Ideally I''d like to have at least 300 gigs of storage > JT> available to the users, or more if I can do it with something like > JT> a RAID 5 setup. My concern, however, is that the boot partition > JT> and root partitions have data redundancy. > JT> How would you setup this box? > JT> It''s primary used as a development server, running a myriad of applications. > > > Use SVM to mirror system, something like: > > d0 mirror of c0t0d0s0 and c0t1d0s0 / 2GB > d5 mirror of c0t0d0s1 and c0t1d0s1 /var 2GBIMNSHO, having a separate /var is a complete waste of effort. Also, 2 GBytes is too small.> d10 mirror of c0t2d0s0 and c0t3d0s0 swap (2+2GB, to match above)Also a waste, use a swap file. Add a dumpdev if you care about kernel dumps, no need to mirror a dumpdev.> an all 4 disks create s4 slice with the rest of the disk, should > be equal on all disks. Then create raidz pool out of those slices. > You should get above 400GB of usable storage. > > That way you''ve got mirrored root disks, mirrored swap on another > two disks matching exactly the space used by / and /var and the > rest of the disk for your data on zfs. > > > ps. and of course you''ve got to create small slices for metadb''s.Simple /. Make it big enough to be useful. Keep its changes to a minimum. Make more than one, so that you can use LiveUpgrade. For consistency, you could make each disk look the same. s0 / 10G s6 zpool free s7 metadb 100M Use two disks for your BE, the other two for your ABE (assuming all are bootable). The astute observer will note that you could also use the onboard RAID controller for the same, simple configuration, less metadb of course. -- richard
On 11/7/06, Richard Elling - PAE <Richard.Elling at sun.com> wrote:> > d10 mirror of c0t2d0s0 and c0t3d0s0 swap (2+2GB, to match above) > > Also a waste, use a swap file. Add a dumpdev if you care about > kernel dumps, no need to mirror a dumpdev.How do you figure that allocating space to a swap file is less of a waste than adding space to a swap device?> Simple /. Make it big enough to be useful. Keep its changes to a > minimum. Make more than one, so that you can use LiveUpgrade. > For consistency, you could make each disk look the same. > s0 / 10G > s6 zpool free > s7 metadb 100MSince ZFS can get performance boosts from enabling the disk write cache if it has the whole disk, you may want to consider something more like the following for two of the disks (assumes mirroring rather than raidz in the zpool): s0 / 10G s1 swap <pick your size> s3 alt / 10G s6 zpool free s7 metadb 100M The other pair of disks are given entirely to the zpool.> Use two disks for your BE, the other two for your ABE (assuming all are > bootable).In any case, be sure that your root slices do not start at cylinder 0 (hmmm... maybe this is SPARC-specific advice...). One way to populate an ABE is to mirror slices. However, you cannot mirror between a device that starts at cylinder 0 and one that does not. Consider the following mock-up (output may be a bit skewed): Starting state... # lustatus slice0 - active mounted at d0 slice3 - may or may not exist, if it exists it is on d30 # metastat -p d0 -m d1 d2 1 d1 1 1 c0t0d0s0 d2 1 1 c0t1d0s0 d30 -m d31 d32 1 d31 1 1 c0t0d0s3 d32 1 1 c0t1d0s3 Get rid of slice3 boot environment, make d31 available to recreate it. # ludelete slice3 # metadetach d30 d31 # metaclear -r d30 Mirror d0 to d31. Wait for it to complete. # metattach d0 d31 # while metastat -p | grep % ; do sleep 30 ; done Detach d31 from d0, recreate d30 mirror # metadetach d0 d31 # metainit d30 -m d31 1 # metainit d32 1 1 c0t1d0s3 # metattach d30 d32 Create boot environment named slice3: # lucreate -n slice3 /:d30:ufs,preserve Now you can manipulate the slice3 boot environment as needed. Why go through all of this? My reasons have typically been: 1) Normally lucreate uses cpio, which doesn''t cope with sparse files well. /var/adm/lastlog is a sparse file that can be problematic if you have users with large UID''s 2) Lots of file systems mounted and little interest in creating very complex command lines with many -x options. Mike -- Mike Gerdts http://mgerdts.blogspot.com/
Richard Elling - PAE
2006-Nov-08 23:21 UTC
[zfs-discuss] Best Practices recommendation on x4200
Mike Gerdts wrote:> On 11/7/06, Richard Elling - PAE <Richard.Elling at sun.com> wrote: >> > d10 mirror of c0t2d0s0 and c0t3d0s0 swap (2+2GB, to match above) >> >> Also a waste, use a swap file. Add a dumpdev if you care about >> kernel dumps, no need to mirror a dumpdev. > > How do you figure that allocating space to a swap file is less of a > waste than adding space to a swap device?If you ever guess wrong (which you will), you can just make another swap file or redo the existing swap file. If you carve out a slice, then reclaiming the space is much more difficult. Creating more slices tends to also be difficult, so when you quess wrong you may still end up swapping to files.>> Simple /. Make it big enough to be useful. Keep its changes to a >> minimum. Make more than one, so that you can use LiveUpgrade. >> For consistency, you could make each disk look the same. >> s0 / 10G >> s6 zpool free >> s7 metadb 100M > > Since ZFS can get performance boosts from enabling the disk write > cache if it has the whole disk, you may want to consider something > more like the following for two of the disks (assumes mirroring rather > than raidz in the zpool): > > s0 / 10G > s1 swap <pick your size> > s3 alt / 10G > s6 zpool free > s7 metadb 100M > > The other pair of disks are given entirely to the zpool. > >> Use two disks for your BE, the other two for your ABE (assuming all are >> bootable). > > In any case, be sure that your root slices do not start at cylinder 0 > (hmmm... maybe this is SPARC-specific advice...).I think this is folklore. Can you cite a reference? NB. traditionally, block 0 contains the vtoc and for SPARC systems, blocks 1-15 contain the bootblocks, see installboot(1M). Cylinder 0 may contain thousands of blocks for modern disks. It is a waste not to use them. AFAIK, all Sun software which deals with raw devices is aware of this.> One way to populate > an ABE is to mirror slices. However, you cannot mirror between a > device that starts at cylinder 0 and one that does not.Where is this restriction documented? It doesn''t make sense to me. Maybe you have a scar from running Sybase in a previous life? ;-) -- richard
Nathan Kroenert
2006-Nov-09 00:00 UTC
[zfs-discuss] Best Practices recommendation on x4200
On Thu, 2006-11-09 at 10:21, Richard Elling - PAE wrote:> > One way to populate > > an ABE is to mirror slices. However, you cannot mirror between a > > device that starts at cylinder 0 and one that does not. > > Where is this restriction documented? It doesn''t make sense to me. > Maybe you have a scar from running Sybase in a previous life? ;-)IIRC, that''s a part of the history of disksuite / SVM. Moreover, it was that you cannot mirror a slice that has a VTOC label on it to one that does not... (hence the understanding of it being a cylinder 0 issue). http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/lvm/mirror/mirror_ioctl.c#887 Or, perhaps I need more coffee... Cheers! Nathan. ;)