We are preparing for the day when ZFS will be preinstalled on systems at Sun''s factory. This design will occur in two phases: before and after ZFS is a bootable file system. We''re interested in feedback from the community on how to allocate the disk resources. In systems currently shipping from the factory with Solaris preinstalled, the boot disk layout follows the Enterprise Installation Services (EIS) boot disk standard. There is enough uhmm.. flexibility in the slice allocations in that standard that it is unlikely to mesh with most customer standards, where they exist. With UFS, depending on the product, we allocate slice 1 for /, slice 2 for swap, etc. The size of the allocations again depends on the size of the boot disk and the system. In other words, there is significant variance between systems because once the allocations are made, they are difficult to change (UFS) and may require reinstallation of the OS. Currently non-boot disks are not preallocated or preinstalled. Other vendors often assign the boot disk as one big partition and put everything in that partition. Most Sun docs follow the traditional, old UNIX method of having different slices (partitions) for /, swap, et.al. Another way to look at this is that whatever UFS allocation we make for the factory, it is wrong for a significant number of customers. Hence the request for feedback. 1. Before ZFS is a bootable file system: We still have the issue where the boot disk requires pre- allocated UFS slices. The current thinking is that we won''t change the boot disk from existing EIS standards. However we''d like to add the other, non-boot disks in the system as a big zpool. 2. After ZFS is a bootable file system: Just put all disks in a big zpool. We will use best practices for RAS in the zpool based upon the capabilities of the system with the priority policy such that data should be protected first and free space optimized second. Systems which can support hardware RAID of the bootdisk, may or may not be mirrored using hardware RAID. That decision will be made later, on a case-by-case basis, and doesn''t materially affect the disk layout policies. Our thinking is that once you have a big zpool, it is very easy to add, delete, or change file systems, zvols, or whatever. In other words, whatever choices we make for file system allocations in the factory cannot be wrong. If you don''t like it, changing it does not require a reinstall of the OS. Life is much simpler. Simple is good. Comments? -- richard This message posted from opensolaris.org
The only suggestion I''d make here is to ensure that nothing uses any space in that big zpool of remaining disk space - ie. do not use it for /export/home or similar. This should leave you open to use it if you want but not having to reinstall or flail about if you want to do it differently. Darren
One reason not to put all the disks into one big zpool is that there will likely be some restrictions on root pools that will not apply to other pools. At this time, it''s looking like we will not support concatenation or RAIDZ in root pools, at least not initially. This is mainly due to limitations on how many disks the firmware can access at boot time. At first release of zfs boot, the only multi-disk configuration possible for a root pool will be a mirrored configuration. I suggest dedicating one disk to the root pool and putting all of the rest of the disks into a second pool. Lori Richard Elling wrote:>We are preparing for the day when ZFS will be preinstalled >on systems at Sun''s factory. This design will occur in two phases: >before and after ZFS is a bootable file system. We''re interested >in feedback from the community on how to allocate the disk >resources. > >In systems currently shipping from the factory with Solaris >preinstalled, the boot disk layout follows the Enterprise >Installation Services (EIS) boot disk standard. There is >enough uhmm.. flexibility in the slice allocations in that >standard that it is unlikely to mesh with most customer >standards, where they exist. With UFS, depending on the >product, we allocate slice 1 for /, slice 2 for swap, etc. >The size of the allocations again depends on the size of >the boot disk and the system. In other words, there is >significant variance between systems because once >the allocations are made, they are difficult to change (UFS) >and may require reinstallation of the OS. > >Currently non-boot disks are not preallocated or preinstalled. > >Other vendors often assign the boot disk as one big >partition and put everything in that partition. Most Sun docs >follow the traditional, old UNIX method of having different >slices (partitions) for /, swap, et.al. > >Another way to look at this is that whatever UFS allocation >we make for the factory, it is wrong for a significant number >of customers. Hence the request for feedback. > >1. Before ZFS is a bootable file system: > > We still have the issue where the boot disk requires pre- > allocated UFS slices. The current thinking is that we won''t > change the boot disk from existing EIS standards. However > we''d like to add the other, non-boot disks in the system as > a big zpool. > >2. After ZFS is a bootable file system: > > Just put all disks in a big zpool. > > >We will use best practices for RAS in the zpool based upon the >capabilities of the system with the priority policy such that data >should be protected first and free space optimized second. > >Systems which can support hardware RAID of the bootdisk, >may or may not be mirrored using hardware RAID. That >decision will be made later, on a case-by-case basis, and >doesn''t materially affect the disk layout policies. > >Our thinking is that once you have a big zpool, it is very easy to >add, delete, or change file systems, zvols, or whatever. In >other words, whatever choices we make for file system allocations >in the factory cannot be wrong. If you don''t like it, changing it >does not require a reinstall of the OS. Life is much simpler. >Simple is good. > >Comments? > -- richard >This message posted from opensolaris.org >_______________________________________________ >zfs-discuss mailing list >zfs-discuss at opensolaris.org >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >
As a former customer, I''d recommend: - Configure swap as s0, with 2 x memory up until 8gb. If you''ve 4gb+ memory and you need that much swap, you''re going to want to add additional swap disks for better performance. - Configure the rest of the initial system disk as a pool. - Leave the rest alone, with the idea that the customer can allocate it as needed. - As a customer, I''d then configure the 2nd disk (preferably on another controller) as a mirror. The rest would exist as another pool to allow for system upgrades (and consequent root disk pool rebuilds) without impacting the ''data disks''. It''s rather similar to the veritas vxvm idea of ''rootdg'' for the root disk(s) and another disk group for the remaining storage. Just my $.02. On Mar 23, 2006, at 9:07 PM, Lori Alt wrote:> One reason not to put all the disks into one big zpool is > that there will likely be some restrictions on root pools > that will not apply to other pools. At this time, it''s looking > like we will not support concatenation or RAIDZ in > root pools, at least not initially. This is mainly due to > limitations on how many disks the firmware can access > at boot time. At first release of zfs boot, the only multi-disk > configuration possible for a root pool will be a mirrored > configuration. > I suggest dedicating one disk to the root pool and putting > all of the rest of the disks into a second pool. > Lori > > Richard Elling wrote: > >> We are preparing for the day when ZFS will be preinstalled >> on systems at Sun''s factory. This design will occur in two phases: >> before and after ZFS is a bootable file system. We''re interested >> in feedback from the community on how to allocate the disk resources. >> >> In systems currently shipping from the factory with Solaris >> preinstalled, the boot disk layout follows the Enterprise >> Installation Services (EIS) boot disk standard. There is >> enough uhmm.. flexibility in the slice allocations in that >> standard that it is unlikely to mesh with most customer >> standards, where they exist. With UFS, depending on the >> product, we allocate slice 1 for /, slice 2 for swap, etc. >> The size of the allocations again depends on the size of >> the boot disk and the system. In other words, there is >> significant variance between systems because once >> the allocations are made, they are difficult to change (UFS) >> and may require reinstallation of the OS. >> >> Currently non-boot disks are not preallocated or preinstalled. >> >> Other vendors often assign the boot disk as one big >> partition and put everything in that partition. Most Sun docs >> follow the traditional, old UNIX method of having different >> slices (partitions) for /, swap, et.al. >> >> Another way to look at this is that whatever UFS allocation we >> make for the factory, it is wrong for a significant number >> of customers. Hence the request for feedback. >> >> 1. Before ZFS is a bootable file system: >> >> We still have the issue where the boot disk requires pre- >> allocated UFS slices. The current thinking is that we won''t >> change the boot disk from existing EIS standards. However >> we''d like to add the other, non-boot disks in the system as >> a big zpool. >> 2. After ZFS is a bootable file system: >> >> Just put all disks in a big zpool. >> >> >> We will use best practices for RAS in the zpool based upon the >> capabilities of the system with the priority policy such that data >> should be protected first and free space optimized second. >> >> Systems which can support hardware RAID of the bootdisk, >> may or may not be mirrored using hardware RAID. That decision >> will be made later, on a case-by-case basis, and doesn''t >> materially affect the disk layout policies. >> >> Our thinking is that once you have a big zpool, it is very easy to >> add, delete, or change file systems, zvols, or whatever. In >> other words, whatever choices we make for file system allocations >> in the factory cannot be wrong. If you don''t like it, changing it >> does not require a reinstall of the OS. Life is much simpler. >> Simple is good. >> >> Comments? >> -- richard >> This message posted from opensolaris.org >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
1. Before ZFS is a bootable file system: We still have the issue where the boot disk requires pre- allocated UFS slices. The current thinking is that we won''t change the boot disk from existing EIS standards. However we''d like to add the other, non-boot disks in the system as a big zpool. ** sounds good. Although I think you should (depending on how many disks) add it as a raid-z/mirror pool. A simple concatenated pool (even with ZFS) might give people the wrong impression in that the pool won''t be protected, and there is NOT an easy way of transforming a pool from concat, to raidz or something (depending on how many internal disks you dealing with?) 2. After ZFS is a bootable file system: Just put all disks in a big zpool. ** Hmm... Not ALL disks.... Say on a 4-disk-capable system like a t2000.... Does it make sense to Mirror the boot disks, say the "boot_zpool" (mirrored), and then have a DATA_zpool, or something along those lines for the other 2 disks. (also mirrored)? The thought would then be that if ALL your t2000 systems would come with the same "boot_pool" all configured the same way. And if someone orders the additional 2 disks, those get dropped into a data_zpool or something). Having said that, I like the appeal of a single zpool, (raid-Z) across all 4 disks.... There really shouldn''t be a reason of splitting off the data-disks, since you can''t export them anyway. (they''re internal) and I could see that as the only/main reason to split the boot-pool from the data-pool...) Ok, I''m convinced... Stick all internal disks in a RAIDZ pool, and carve up the boot volumes etc. as needed. (this has SO MUCH potential when it comes to live-upgrade, whole-system snapshots before application upgrades, patching, etc. etc. Either way, going to "pool route" should allow the initial disk/volume-size/layout to stay the same regardless of someone ordering 1,2,3,4 disks. (the good news here being that if someone purchases more than 2 disks, their data is automatically protected, and regardless of how many disks they purchase the standard layout is the same....) Please beware of scenarios that would force someone to change from mirrored to raid-z etc. as (if memory servers) those conversions, and adding a disk into a raidz pool etc. aren''t 100% there yet. Thanks, -- MikeE Michael J. Ellis (mike.ellis at fidelity.com) FISC/UNIX Engineering 400 Puritan Way (M2G) Marlborough, MA 01752 Phone: 508-787-8564 -- We will use best practices for RAS in the zpool based upon the capabilities of the system with the priority policy such that data should be protected first and free space optimized second. Systems which can support hardware RAID of the bootdisk, may or may not be mirrored using hardware RAID. That decision will be made later, on a case-by-case basis, and doesn''t materially affect the disk layout policies. Our thinking is that once you have a big zpool, it is very easy to add, delete, or change file systems, zvols, or whatever. In other words, whatever choices we make for file system allocations in the factory cannot be wrong. If you don''t like it, changing it does not require a reinstall of the OS. Life is much simpler. Simple is good. Comments? -- richard This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard Elling
2006-Mar-24 06:05 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
> As a former customer, I''d recommend: > > - Configure swap as s0, with 2 x memory up until 8gb. If you''ve 4gb+ > emory and you need that much swap, you''re going to want to add > additional swap disks for better performance.Disagree. If you have to swap, your performance stinks, period. Since swap is the default dumpdev, configure swap to handle the expected dump. Beyond that, it is a waste of space. Indeed most systems today do not use swap space at all during normal operation.> - Configure the rest of the initial system disk as a pool. > - Leave the rest alone, with the idea that the customer can allocate > it as needed.We''ve done that in the past. What we find is that the space is never allocated. Think about it. What would happen if my grandmother got a name brand PC where half of the disk was not allocated. Would she ever figure it out, or would she call customer service and complain that she ordered a 400 GByte drive and only got a 200 GByte drive? Since answering calls costs money, why would a vendor do that?> - As a customer, I''d then configure the 2nd disk (preferably on > another controller) as a mirror. The rest would exist as another > pool to allow for system upgrades (and consequent root disk pool > rebuilds) without impacting the ''data disks''.In the old days, when we used bus-based disk interconnects, the worry was that a hung controller or bus would take out all drives. Today, with SAS and SATA as point-to-point disk interconnects this failure mode is eliminated. The controllers are highly integrated and reliable, much more reliable than the disks (by one or more orders of magnitude). So expect that you will have one controller with several fault-isolated disks. Also, expect that they won''t be disks at all (and have different failure modes).> It''s rather similar to the veritas vxvm idea of ''rootdg'' for the root > disk(s) and another disk group for the remaining storage.Yes. I think this is what Lori is alluding to: special restrictions on the boot pool may lead to a practice of using a separate pool for disks beyond the boot disks. A few years ago I did a study and found that > 90% of Sun customers use < 3.3 GBytes for boot and basic OS services. So, another question is what to do with the other 495+ GBytes on the disk? A second pool has some merit.> Just my $.02.Thanks. I just need 98 more and I''ll have a dollar :-) -- richard This message posted from opensolaris.org
> 2. After ZFS is a bootable file system: > > Just put all disks in a big zpool.The one caveat here is that you may want the ability to export a pool from system A and import it on system B. For this to work, the data must be on disks that system A doesn''t need to boot -- and thus, A''s root can''t be there. As a rule, I like keeping the machine''s personality (i.e. the root filesystem) separate from the data. Eventually the root filesystem will become like the SIM card in your cell phone. We already have systems in the lab with root on compact flash. This gives you a clean separation between your data (on migratable disks), the identity of the system accessing the data (on the CF card), and the system hardware. Jeff
> - Configure swap as s0, with 2 x memory up until 8gb.If you''re using ZFS, I''d recommend swapping on a zvol. There are several advantages to doing this: (1) You don''t need a separate swap slice. (2) You can grow it whenever you want. (3) You get checksums, compression, dynamic striping, RAID-Z, etc for free. On the flip side, I strongly recommend a dedicated dump slice. We''ll eventually support doing dumps through a storage pool, but it''s really better to go straight to a physical device. When you''re generating a crash dump, the kernel is damaged; so the less code you need to generate the dump, the better. Jeff
Darren Dunham
2006-Mar-24 07:50 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
> 1. Before ZFS is a bootable file system: > > We still have the issue where the boot disk > requires pre- > allocated UFS slices. The current thinking is that > we won''t > change the boot disk from existing EIS standards. > However > we''d like to add the other, non-boot disks in the > system as > a big zpool.Will the ability to remove disks from a pool be available at this point? If not, I think I''m going to have to blow away the pool to grab one of the disks out as a root mirror. In virtual any multi-disk configuration, I''m going to want a root mirror. -- Darren This message posted from opensolaris.org
Peter Tribble
2006-Mar-24 10:44 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
On Fri, 2006-03-24 at 06:05, Richard Elling wrote:> > As a former customer, I''d recommend: > > > > - Configure swap as s0, with 2 x memory up until 8gb. If you''ve 4gb+ > > emory and you need that much swap, you''re going to want to add > > additional swap disks for better performance. > > Disagree. If you have to swap, your performance stinks, period. > Since swap is the default dumpdev, configure swap to handle > the expected dump. Beyond that, it is a waste of space. Indeed > most systems today do not use swap space at all during normal > operation.While the sentiment''s heading in the right direction, I would disagree with the "most" bit. At my previous employer, I would build V880s with 32G of RAM and 96G of disk swap. They never ran out of swap. But getting to 50% wasn''t entirely uncommon. Many of the systems I now look after follow the rule you mention - size for dump - (and in one case I couldn''t even get a crash dump) and we''re seeing regular problems running out of swap and /tmp. Not all workloads are the same. For most new systems you only need a tiny fraction of the disk to make sure you never have a problem.> A few years ago I did a study and found that > 90% of Sun > customers use < 3.3 GBytes for boot and basic OS services. > So, another question is what to do with the other 495+ GBytes > on the disk? A second pool has some merit.[495G of swap? Just kidding!] This just confirms that I must be a weird customer. % df -h / Filesystem size used avail capacity Mounted on /dev/dsk/c1t0d0s0 28G 21G 6.3G 78% / I actually broke my own habits here by splitting the 72G disk into 2 - the other half was a zfs pool for most of its life. So most of my "data" is on the other slice. Even a reasonably minimal server running S9 I just built uses 7.2G. Simplicity always wins. For a long time now (not long enough - there was some pain before I switched) I''ve just gone for single large /. On desktops that''s it. On servers with 2 disks, mirror it. Every time it gets split up, you lose. (Which is one reason why zfs is such a great thing, of course!) On systems with more drives then I always split data onto the other drives. (On a 2-disk system I can''t, obviously.) That way I can reinstall and do whatever I want with the boot disk without affecting data. Or I can pull the data drives and drop them into another box. But you shouldn''t listen to me. Because whatever you do - no matter how good the preinstalled image is - I will overwrite it off my jumpstart server. I suspect the active community is the wrong place to ask. -- -Peter Tribble L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
Robert Milkowski
2006-Mar-24 10:57 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
Hello Peter, I wonder how many customers put into production pre-installed Solaris and how many of them do put their''own Solaris (I mean jumpstart, etc.). First thing I do when new server arrives I install using jumpstart fresh Solaris I do not bother at all about pre-installed system. Now when it comes to partitioning - in most systems I''ve got separate /var /opt / with 2GB for root at most, minimum 2GB for /var, and the rest from local disks for /opt. All these partitions are mirrored (SVM). In case of ZFS I would probably do something similar but in one pool. for example: boot/root / boot/var /var boot/opt /opt Where boot is a mirrored zpool. Then I would make separate zpool for other disks. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Roch Bourbonnais - Performance Engineering
2006-Mar-24 11:40 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
> As a former customer, I''d recommend:> > - Configure swap as s0, with 2 x memory up until 8gb. If you''ve 4gb+ > emory and you need that much swap, you''re going to want to add > additional swap disks for better performance. Disagree. If you have to swap, your performance stinks, period. Since swap is the default dumpdev, configure swap to handle the expected dump. Beyond that, it is a waste of space. Indeed most systems today do not use swap space at all during normal operation. Disagree. I think you mean, if you have to swap, in and out, your performance will stink. But it still feels that swap can be useful to swap out stuff while the owner has left for an extended vacation. -r
Gregory Shaw
2006-Mar-24 13:51 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
On Mar 23, 2006, at 11:05 PM, Richard Elling wrote:>> As a former customer, I''d recommend: >> >> - Configure swap as s0, with 2 x memory up until 8gb. If you''ve 4gb+ >> emory and you need that much swap, you''re going to want to add >> additional swap disks for better performance. > > Disagree. If you have to swap, your performance stinks, period. > Since swap is the default dumpdev, configure swap to handle > the expected dump. Beyond that, it is a waste of space. Indeed > most systems today do not use swap space at all during normal > operation. >I''m sorry I wasn''t clear on swap. I totally agree; swap is there for crash dumps. If you get into swap, you''ve sized your server incorrectly. That''s why I don''t like to waste disk space on swap.>> - Configure the rest of the initial system disk as a pool. >> - Leave the rest alone, with the idea that the customer can allocate >> it as needed. > > We''ve done that in the past. What we find is that the space is > never allocated. Think about it. What would happen if my > grandmother got a name brand PC where half of the disk was > not allocated. Would she ever figure it out, or would she call > customer service and complain that she ordered a 400 GByte > drive and only got a 200 GByte drive? Since answering calls > costs money, why would a vendor do that? >If the customer isn''t technical enough to run the system, whatever you do won''t be sufficient. Sun servers and workstations aren''t generally for your mother. Buy your mother a mac and be done with it. :-)>> - As a customer, I''d then configure the 2nd disk (preferably on >> another controller) as a mirror. The rest would exist as another >> pool to allow for system upgrades (and consequent root disk pool >> rebuilds) without impacting the ''data disks''. > > In the old days, when we used bus-based disk interconnects, the > worry was that a hung controller or bus would take out all drives. > Today, with SAS and SATA as point-to-point disk interconnects > this failure mode is eliminated. The controllers are highly > integrated > and reliable, much more reliable than the disks (by one or more > orders of magnitude). So expect that you will have one controller > with > several fault-isolated disks. Also, expect that they won''t be > disks at > all (and have different failure modes). >Perhaps that''s a recent change. I haven''t experienced SAS disks. However, a number of current systems (V40Z, V490, etc.) ship with a single bus-based SCSI-SCA interface. Hanging disks still take them out.>> It''s rather similar to the veritas vxvm idea of ''rootdg'' for the root >> disk(s) and another disk group for the remaining storage. > > Yes. I think this is what Lori is alluding to: special > restrictions on > the boot pool may lead to a practice of using a separate pool for > disks beyond the boot disks. > > A few years ago I did a study and found that > 90% of Sun > customers use < 3.3 GBytes for boot and basic OS services. > So, another question is what to do with the other 495+ GBytes > on the disk? A second pool has some merit. >Actually, here''s my feelings on the future: With the availability of large flash storage (see http://www.geek.com/ news/geeknews/2006Mar/bpd20060323035442.htm), the system disk should be converted to boot from flash storage. Since flash isn''t that fast, the system should use something like cachefs, and cache as much of the OS in memory as possible. That would allow patches and such to go against a copy of the OS in a duplicated flash disk. Live patches, a live backup copy, and high performance with no moving parts ... I like that.>> Just my $.02. > > Thanks. I just need 98 more and I''ll have a dollar :-) > -- richard > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
Gregory Shaw
2006-Mar-24 14:07 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
I''ve done the same thing -- when a server arrives, it could boot windows for all I care. Since it''s possible to re-image the OS easily and quickly using your own parameters, you don''t lose anything, and tend to gain a lot by reloading the OS. On HP(-UX) servers, however, there is a big difference. HP still lives in the Bad Old Days(tm) and requires codewords for software licenses. They also pre-load the OS with everything that you''ve purchased. It''s entirely possible that you could lose a codeword, and not be able to put your OS back together the same way it was. In that case, you generally leave the system disks alone when it arrives. Personally, I don''t agree with the way the OS arrives on sun servers. So, I reload it. I agree with the below config, with the exception that I use a 8g root slice. I''ve run into situations where the $*#(*^*% software vendor *has* to load something in /usr/ lib. That can fill up your root disk and make patching difficult. I also like /var separate. Generally on a separate set of disks in many cases. On Mar 24, 2006, at 3:57 AM, Robert Milkowski wrote:> Hello Peter, > > I wonder how many customers put into production pre-installed Solaris > and how many of them do put their''own Solaris (I mean jumpstart, > etc.). First thing I do when new server arrives I install using > jumpstart fresh Solaris I do not bother at all about pre-installed > system. > > Now when it comes to partitioning - in most systems I''ve got separate > /var /opt / with 2GB for root at most, minimum 2GB for /var, and the > rest from local disks for /opt. All these partitions are mirrored > (SVM). > > In case of ZFS I would probably do something similar but in one pool. > for example: > > boot/root / > boot/var /var > boot/opt /opt > > Where boot is a mirrored zpool. > > Then I would make separate zpool for other disks. > > > > > -- > Best regards, > Robert mailto:rmilkowski at task.gda.pl > http://milek.blogspot.com > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
Richard Elling <Richard.Elling at Sun.com> writes:> In systems currently shipping from the factory with Solaris > preinstalled, the boot disk layout follows the Enterprise > Installation Services (EIS) boot disk standard. There is > enough uhmm.. flexibility in the slice allocations in that > standard that it is unlikely to mesh with most customer > standards, where they exist. With UFS, depending on the > product, we allocate slice 1 for /, slice 2 for swap, etc. > The size of the allocations again depends on the size of > the boot disk and the system. In other words, there is > significant variance between systems because once > the allocations are made, they are difficult to change (UFS) > and may require reinstallation of the OS.My current plan for a couple of installations next week (both S10 U2 Beta with ZFS and SX) is as follows: slice 0 most of the disk for a zpool (potentially mirrored) slice 1 swap as large as necessary, necessary since you cannot yet dump to a zvol (when is this expected to change, btw?) slice 3 ufs /, mounted -o nodevices (works since in S10 and up /devices is devfs) slice 4 ufs /var, mounted -o nodevices,nosuid slice 5 ufs / (LiveUpgrade ABE) slice 6 ufs /var (Live Upgrade ABE) slice 7 metadb for SVM mirroring of /, /var On single-disk x86 machines, I''d like to add two more /, /usr ufs slices for further ABEs which is possible with the 16-slice limit in the x86 VTOC, but you probably cannot create them from JumpStart (or format), which is ugly. It works with fmthard, though. I''ll probably create slice 6 larger than necessary to accomodate for those two additional pairs of slices and fix this up later. This way, once zfs boot comes along, you can easily destroy slices 1-7, grow slice 0 and be done with it. I''ve actually done something like this when upgrading my laptop from SVM with soft partitions to zfs. The only caveat was that I had to boot the failsave environment to do so: I couldn''t grow a slice in use by a zpool, although this would have been safe ;-( Btw, has anyone tried to create zpools and zfs file systems during custom jumpstart yet? I''ll try this once I get a chance, but any success stories or caveats would be useful. I hope to work on a JetZFS module to automate this. Rainer -- ----------------------------------------------------------------------------- Rainer Orth, Faculty of Technology, Bielefeld University
"Ellis, Mike" <Mike.Ellis at fmr.com> writes:> 2. After ZFS is a bootable file system: > > Just put all disks in a big zpool. > > ** Hmm... Not ALL disks.... Say on a 4-disk-capable system like a > t2000.... Does it make sense to Mirror the boot disks, say the > "boot_zpool" (mirrored), and then have a DATA_zpool, or something along > those lines for the other 2 disks. (also mirrored)? The thought would > then be that if ALL your t2000 systems would come with the same > "boot_pool" all configured the same way. And if someone orders the > additional 2 disks, those get dropped into a data_zpool or something). > > Having said that, I like the appeal of a single zpool, (raid-Z) across > all 4 disks.... There really shouldn''t be a reason of splitting off the > data-disks, since you can''t export them anyway. (they''re internal) and I > could see that as the only/main reason to split the boot-pool from the > data-pool...) > > Ok, I''m convinced... Stick all internal disks in a RAIDZ pool, and carve > up the boot volumes etc. as needed. (this has SO MUCH potential when it > comes to live-upgrade, whole-system snapshots before application > upgrades, patching, etc. etc.I don''t like this: given a system with a couple of internal disks (say a V240 with 4 disks), splitting them into (say) two zpools makes sense to me: consider a 2-disk mirrored zpool for root (and any other O/S file systems), and another (mirrored or striped) zpool for data. If the machine breaks for some reason, it might be useful to be able to extract just the two disks with the data zpool on them, move them to a different machine with free disk slots, import them and be done with it. If you have only one 4-disk zpool, you cannot do that. Besides, you may want different failure characteristics for the boot and data zpools: I''d like my boot disk mirrored on production systems, but I can imagine going for a striped data pool where all I care about is performance. I even thought about doing something similar on my 2-disk Blade 1500: have two slices in a mirrored zpool for important (system or other) filesystems, and stripe across two different slices for e.g. OpenSolaris builds. Rainer -- ----------------------------------------------------------------------------- Rainer Orth, Faculty of Technology, Bielefeld University
David Robinson
2006-Mar-24 16:03 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
Richard Elling wrote:>>As a former customer, I''d recommend: >> >>- Configure swap as s0, with 2 x memory up until 8gb. If you''ve 4gb+ >>emory and you need that much swap, you''re going to want to add >>additional swap disks for better performance. > > > Disagree. If you have to swap, your performance stinks, period. > Since swap is the default dumpdev, configure swap to handle > the expected dump. Beyond that, it is a waste of space. Indeed > most systems today do not use swap space at all during normal > operation.Obviously you don''t run freeware with all the memory leaks they have! :-) Having a dump partition that is big enough to hold all the crash bits is all that is really needed. If you really need more swap space (leaky apps) just create a swap file on your root pool. There is no harm in swapping to a file, and probably even better to swap to a zvol. Because ZFS is fundamentally dynamic around sizes, the old problem of not having enough space on the "right" partition is gone. You can add them and delete them on a running system with no harm. -David
There a lot of input here that needs to be considered, but let me call out a couple recommendations from the zfs group: 1. Consider separating the root pool from the data pool or pools. As Jeff Bonwick said, there are good reasons to keep the "personality" of a system separate from data. You will often want data to be sharable with other systems, but you will seldom want this for the system software. Also, there are likely to be some restrictions (at least initially) on the root pool that would be onerous for data pools. 2. Make the root pool big enough for several, or even many, boot environments (i.e. root file systems). One of the great things about having zfs as root will be the flexibility of using snapshots and liveupgrade to keep several boot environments around without having to preallocate partitions for them. Want to test a patch? Use lucreate to clone the current boot environment (which will be very fast and initially use almost no additional space), patch the new BE, try it out. Don''t like it? Your original BE is still around. I might think of some more general recommendations, but that''s all that comes to mind right now. Lori
Karl Rossing
2006-Mar-24 19:28 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
I have been doing /var=swap+4GB I now SEE how there is a lot of thought behind Solaris being pre-installed on new hardware. I guess i''ll feel bad the next time i re-install solaris after a new server is un-boxed. Karl PS: 1st post! This message posted from opensolaris.org
On 3/24/06, Lori Alt <Lori.Alt at sun.com> wrote:> There a lot of input here that needs to be considered, > but let me call out a couple recommendations from the > zfs group: > > 1. Consider separating the root pool from the data > pool or pools. As Jeff Bonwick said, there are good > reasons to keep the "personality" of a system > separate from data. You will often want data to be > sharable with other systems, but you will seldom want > this for the system software. Also, there are likely > to be some restrictions (at least initially) on the > root pool that would be onerous for data pools.While I agree that "personality" should go with the system, I also want to have a master root/boot image and have most/all of my systems use a "personal" clone of it, rather than entire its own boot disk/filesystem. I know that ZFS cannot address this today, but it can be an interesting feature for the roadmap. -- Regards, Cyril
Cyril Plisko wrote:> On 3/24/06, Lori Alt <Lori.Alt at sun.com> wrote: > >>There a lot of input here that needs to be considered, >>but let me call out a couple recommendations from the >>zfs group: >> >>1. Consider separating the root pool from the data >> pool or pools. As Jeff Bonwick said, there are good >> reasons to keep the "personality" of a system >> separate from data. You will often want data to be >> sharable with other systems, but you will seldom want >> this for the system software. Also, there are likely >> to be some restrictions (at least initially) on the >> root pool that would be onerous for data pools. > > > While I agree that "personality" should go with the system, > I also want to have a master root/boot image and have > most/all of my systems use a "personal" clone of it, rather > than entire its own boot disk/filesystem. I know that ZFS > cannot address this today, but it can be an interesting > feature for the roadmap. >It will be possible to produce a zfs root file system from a jumpstart profile. Also from a flash archive. Would that achieve your goal? What beyond that would you like ZFS to provide? Lori Lori
On 3/24/06, Lori Alt <Lori.Alt at sun.com> wrote:> > > > > > While I agree that "personality" should go with the system, > > I also want to have a master root/boot image and have > > most/all of my systems use a "personal" clone of it, rather > > than entire its own boot disk/filesystem. I know that ZFS > > cannot address this today, but it can be an interesting > > feature for the roadmap. > > > > It will be possible to produce a zfs root file > system from a jumpstart profile. Also from a > flash archive. Would that achieve your goal?That is not the same. In this case if you have, say, 100 machines you will use 100 x 3 GB (assuming your root image is 3 GB). What I meant is different. I have a master image of 3 GB and creates ZFS clone for each machine. ZFS clone consumes only a much disk space as _different_ files for each machine take. the vast amount of files are same for every machine and thus shared. Of course it requires shared disks (SAN/iSCSI) and kinda cluster-aware ZFS.> What beyond that would you like ZFS to provide? > > Lori > > Lori >-- Regards, Cyril
Dennis J. Behrens
2006-Apr-10 12:05 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
I imagine this reply is a tad late, but heck I felt the need to reply to this thread. While I suspect that most knowledgeable Sun shops will take the box, rack mount it, and then jumpstart the system. These are typically the Sun customers who will actually mirror their rootdisks and worry about disk layout. What Sun should worry about is the layout for the other Sun customers who don''t jumpstart their systems. I''ve installed systems at some customer sites where that Sun box is the only one in their datacenter. Those customers want to be able to plug in the system and go. This is where the pre-installed image will be used. Having said that, I believe the current EIS standards are fine for the pre-zfs world. The only major change I would suggest is putting /export/home in a zvol. Then once ZFS can boot, set up two disks in a zpool and mirror them--to at least work with the firmware to be able to boot the systems. Then the rest of the disks attached to the system, put into a zpool and toss /export/home in that zpool. In both configs just put swap in as large enough to handle the crashdump, which is fine for most systems, other than Oracle servers or SAP installations. But typically those environments have some admin on staff who is competent to add more swap. And I''d think that putting the extra swap in a file within zfs would be as good if not better than a swap partition. I suppose this isn''t hard numbers or anything, but I can''t see putting hard numbers down for specific file systems, due to the variation in systems out there. For example enough swap for a 25k domain would likely be far too much for an x86 system, or a V120. This message posted from opensolaris.org
On 4/10/06, Dennis J. Behrens <dbehrens at second-2-none.net> wrote:> In both configs just put swap in as large enough to handle the crashdump, > which is fine for most systems, other than Oracle servers or SAP installations. > But typically those environments have some admin on staff who is competent > to add more swap. And I''d think that putting the extra swap in a file within > zfs would be as good if not better than a swap partition.Why would you suggest that swap should be larger for Oracle? Are you suggesting that it is more efficient to cause Oracle to page to swap (let the OS guess what it is doing) rather than tuning Oracle to only use RAM and page to its data files per its algorithms? Mike -- Mike Gerdts http://mgerdts.blogspot.com/
Dennis J. Behrens
2006-Apr-10 12:57 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
On Mon, 10 Apr 2006, Mike Gerdts wrote:> On 4/10/06, Dennis J. Behrens <dbehrens at second-2-none.net> wrote: >> In both configs just put swap in as large enough to handle the crashdump, >> which is fine for most systems, other than Oracle servers or SAP installations. >> But typically those environments have some admin on staff who is competent >> to add more swap. And I''d think that putting the extra swap in a file within >> zfs would be as good if not better than a swap partition. > > Why would you suggest that swap should be larger for Oracle? Are you > suggesting that it is more efficient to cause Oracle to page to swap > (let the OS guess what it is doing) rather than tuning Oracle to only > use RAM and page to its data files per its algorithms? > > MikeI wasn''t intending on suggesting that it''s better to have larger swap space for performance reasons for Oracle servers. It''s just more of an observations that I have encountered at many customer sites. Obviously for optimal performance you want to tune Oracle to keep itself in RAM rather than being paged out to swap. --Dennis -- Dennis J. Behrens dbehrens at second-2-none.net "If you insist on using Windoze you''re on your own." "I sense much windows in you, Windows leads to bluescreens, Bluescreens leads to crashing, Crashing leads to...Suffering"
Gregory Shaw
2006-Apr-10 13:38 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
Oracle requires a lot of swap due to user processes. Somebody correct me if I don''t say this right, but when user processes start, Oracle reserves space in swap for the process. Since these can be large processes, it can eat a lot of swap. Note that oracle doesn''t actually *use* the swap unless needed. It''s just a reserve. I think it would use it if critically out of memory. You''ve got other problems if your oracle box is that memory constrained. However, if you run out of swap reserve capacity, it''s a Bad Thing(tm). I''m sure someone has a more detailed technical answer, but that''s what I''ve seen on my big oracle boxes. On Apr 10, 2006, at 6:43 AM, Mike Gerdts wrote:> On 4/10/06, Dennis J. Behrens <dbehrens at second-2-none.net> wrote: >> In both configs just put swap in as large enough to handle the >> crashdump, >> which is fine for most systems, other than Oracle servers or SAP >> installations. >> But typically those environments have some admin on staff who is >> competent >> to add more swap. And I''d think that putting the extra swap in a >> file within >> zfs would be as good if not better than a swap partition. > > Why would you suggest that swap should be larger for Oracle? Are you > suggesting that it is more efficient to cause Oracle to page to swap > (let the OS guess what it is doing) rather than tuning Oracle to only > use RAM and page to its data files per its algorithms? > > Mike > > -- > Mike Gerdts > http://mgerdts.blogspot.com/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
On 4/10/06, Dennis J. Behrens <dbehrens at second-2-none.net> wrote:> On Mon, 10 Apr 2006, Mike Gerdts wrote: > > > On 4/10/06, Dennis J. Behrens <dbehrens at second-2-none.net> wrote: > >> In both configs just put swap in as large enough to handle the crashdump, > >> which is fine for most systems, other than Oracle servers or SAP installations. > >> But typically those environments have some admin on staff who is competent > >> to add more swap. And I''d think that putting the extra swap in a file within > >> zfs would be as good if not better than a swap partition. > > > > Why would you suggest that swap should be larger for Oracle? Are you > > suggesting that it is more efficient to cause Oracle to page to swap > > (let the OS guess what it is doing) rather than tuning Oracle to only > > use RAM and page to its data files per its algorithms? > > > > Mike > > I wasn''t intending on suggesting that it''s better to have larger swap > space for performance reasons for Oracle servers. It''s just more > of an observations that I have encountered at many customer sites. > Obviously for optimal performance you want to tune Oracle to keep itself > in RAM rather than being paged out to swap. > > --DennisGotcha. I keep seeing people say things like this and I cannot figure out why. When I do get an answer, it is normally because people are using RAM to swap ratios that came about when the cost of RAM was measured in dollars per megabyte. FWIW, on large systems I tend to size swap based upon the space required for a crash dump. When Sun, Oracle, and Veritas have reviewed these configurations, they have raised no red flags. Mike
Darren J Moffat
2006-Apr-10 14:30 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
Mike Gerdts wrote:>> FWIW, on large systems I tend to size swap based upon the space required > for a crash dump. When Sun, Oracle, and Veritas have reviewed these > configurations, they have raised no red flags.An alternative thing to do in those cases is not to have any swap at all and use dumpadm(1m) to configure that disk area as a dedicated dump device. -- Darren J Moffat
Casper.Dik at Sun.COM
2006-Apr-10 14:33 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
>> I wasn''t intending on suggesting that it''s better to have larger swap >> space for performance reasons for Oracle servers. It''s just more >> of an observations that I have encountered at many customer sites. >> Obviously for optimal performance you want to tune Oracle to keep itself >> in RAM rather than being paged out to swap. >> >> --Dennis > >Gotcha. I keep seeing people say things like this and I cannot figure out why. >When I do get an answer, it is normally because people are using RAM to >swap ratios that came about when the cost of RAM was measured in dollars >per megabyte.It''s perhaps also a holdover from the days of Solaris 2.6 when ISM allocations counted towards both physical memory and virtual memory. (So they in effect counted against virtual memory *twice* because physical memory dippeared). As a consequence Solaris 2.6 required huge amounts of unusable swap to allow for large ISM segments. Casper
Richard Elling
2006-Apr-10 16:48 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
On Mon, 2006-04-10 at 05:05 -0700, Dennis J. Behrens wrote:> I imagine this reply is a tad late, but heck I felt the need to reply to this thread.Not late, keep it coming... I''ve got a draft of a summary almost ready.> While I suspect that most knowledgeable Sun shops will take the box, rack mount it, and then jumpstart the system. These are typically the Sun customers who will actually mirror their rootdisks and worry about disk layout.Yes. For some reason, some sys admins believe that knowing how to slice up a disk will ensure their continued employment. I see such complexity as a bug^H^H design deficiency.> What Sun should worry about is the layout for the other Sun customers who don''t jumpstart their systems. I''ve installed systems at some customer sites where that Sun box is the only one in their datacenter. Those customers want to be able to plug in the system and go. This is where the pre-installed image will be used.Yes.> Having said that, I believe the current EIS standards are fine for the pre-zfs world. The only major change I would suggest is putting /export/home in a zvol.The problem with the EIS boot disk standard, is that it has an escape clause which is basically, "or do whatever the customer wants." The personal problem I have with the EIS boot disk standard is that it still advocates multiple file systems for the basic OS services. This does not offer any real protection against any DoS, but does complicate backup/restore and therefore increases your TCO (not a good thing.) Since no other volume OSes have this affliction, it makes sense to get rid of it for Solaris, too. Sometimes we hurt ourselves with good intentions.> Then once ZFS can boot, set up two disks in a zpool and mirror them--to at least work with the firmware to be able to boot the systems. Then the rest of the disks attached to the system, put into a zpool and toss /export/home in that zpool.I think the prevailing wind will go to hardware RAID for boot disks, at least in the server space. But in any case, I think ZFS mirroring will be much simpler to implement and manage than LVM-based mirroring of the boot disk. -- richard
Gregory Shaw
2006-Apr-10 17:34 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
If I read the below right, other volume OSs use one big partition? That''s not true for Linux and HP-UX, *BSD and probably others. ZFS changes the picture significantly, as it doesn''t ''think'' in filesystems any longer. Is there a reason for ZFS to be broken into smaller pieces? I''m expecting to have a UFS boot slice (something small that the kernel bootloader will read) and everything else ZFS. The only reason I could see to have discrete ZFS pools would be a ''crown jewels'' isolation where something needs to be separate from the rest of the data store. Does anybody know of a solution that would require that separation? On Apr 10, 2006, at 10:48 AM, Richard Elling wrote:> On Mon, 2006-04-10 at 05:05 -0700, Dennis J. Behrens wrote: >> I imagine this reply is a tad late, but heck I felt the need to >> reply to this thread. > > Not late, keep it coming... I''ve got a draft of a summary almost > ready. > >> While I suspect that most knowledgeable Sun shops will take the >> box, rack mount it, and then jumpstart the system. These are >> typically the Sun customers who will actually mirror their >> rootdisks and worry about disk layout. > > Yes. For some reason, some sys admins believe that knowing how > to slice up a disk will ensure their continued employment. I see > such complexity as a bug^H^H design deficiency. > >> What Sun should worry about is the layout for the other Sun >> customers who don''t jumpstart their systems. I''ve installed >> systems at some customer sites where that Sun box is the only one >> in their datacenter. Those customers want to be able to plug in >> the system and go. This is where the pre-installed image will be >> used. > > Yes. > >> Having said that, I believe the current EIS standards are fine for >> the pre-zfs world. The only major change I would suggest is >> putting /export/home in a zvol. > > The problem with the EIS boot disk standard, is that it has an > escape clause which is basically, "or do whatever the customer > wants." > > The personal problem I have with the EIS boot disk standard is > that it still advocates multiple file systems for the basic OS > services. This does not offer any real protection against any > DoS, but does complicate backup/restore and therefore increases > your TCO (not a good thing.) Since no other volume OSes have > this affliction, it makes sense to get rid of it for Solaris, too. > Sometimes we hurt ourselves with good intentions. > >> Then once ZFS can boot, set up two disks in a zpool and mirror >> them--to at least work with the firmware to be able to boot the >> systems. Then the rest of the disks attached to the system, put >> into a zpool and toss /export/home in that zpool. > > I think the prevailing wind will go to hardware RAID for boot disks, > at least in the server space. But in any case, I think ZFS mirroring > will be much simpler to implement and manage than LVM-based mirroring > of the boot disk. > -- richard > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
Gregory Shaw wrote:> ZFS changes the picture significantly, as it doesn''t ''think'' in > filesystems any longer. Is there a reason for ZFS to be broken into > smaller pieces? I''m expecting to have a UFS boot slice (something > small that the kernel bootloader will read) and everything else ZFS.Actually, we''re hoping not to need a UFS boot slice. Once we have a zfs GRUB plugin, the kernel bootloader should be able to read what it needs directly from the zfs root pool. The ZFS boot prototype uses a ufs boot slice, but we intend that to be temporary.> > The only reason I could see to have discrete ZFS pools would be a > ''crown jewels'' isolation where something needs to be separate from the > rest of the data store. Does anybody know of a solution that would > require that separation? >Some of this repeats earlier parts of the discussion, but here''s a quick summary: 1. There are good reasons to split root pools from data pools: - It isn''t reasonable to share root pools between sparc and x86 systems, but it IS reasonable to share data pools. - There will be some restrictions (at least initially) on root pools that won''t apply to data pools. - no concatenation or RAID-Z allowed for root pools. Only mirroring. - All devices in a root pool must be accessible from the firmware. - There are advantages to keeping a system''s "personality" separate from the system''s data. But this may not what you meant. Perhaps you were thinking about whether there are reasons to have separate file system for root, /usr, /var, etc. So... 2. ZFS has an interesting property. It makes splitting the file name space into separate file systems less NECESSARY, but it also makes it so easy and cheap that if there are ANY good reasons for splitting the name space into separate file systems, you might as well because there''s no good reason not to do it. The main reason I can see for possibly splitting the filename space into separate file systems is to better support our various kinds of environment virtualization. I''m thinking of (a) zones, and (b) live upgrade boot environments. I believe that zones don''t really require separate file systems in order to share, say, /usr between the global zone and the local zones, but perhaps someone more well-versed in zones can say whether there''s anything to be gained for zones by splitting the namespace into separate file systems along certain fault lines. Liveupgrade, however, is more able to share parts of the name space between boot environments if those parts of the name space are in separate file systems. So there might be some advantages in maintaining multiple BEs (boot environments) if there was some splitting of the name space into separate file systems. For example, you would probably want /export/home to be a separate file system so that it can be shared between BEs. Lori> > On Apr 10, 2006, at 10:48 AM, Richard Elling wrote: > >> On Mon, 2006-04-10 at 05:05 -0700, Dennis J. Behrens wrote: >> >>> I imagine this reply is a tad late, but heck I felt the need to >>> reply to this thread. >> >> >> Not late, keep it coming... I''ve got a draft of a summary almost ready. >> >>> While I suspect that most knowledgeable Sun shops will take the box, >>> rack mount it, and then jumpstart the system. These are typically >>> the Sun customers who will actually mirror their rootdisks and worry >>> about disk layout. >> >> >> Yes. For some reason, some sys admins believe that knowing how >> to slice up a disk will ensure their continued employment. I see >> such complexity as a bug^H^H design deficiency. >> >>> What Sun should worry about is the layout for the other Sun >>> customers who don''t jumpstart their systems. I''ve installed systems >>> at some customer sites where that Sun box is the only one in their >>> datacenter. Those customers want to be able to plug in the system >>> and go. This is where the pre-installed image will be used. >> >> >> Yes. >> >>> Having said that, I believe the current EIS standards are fine for >>> the pre-zfs world. The only major change I would suggest is putting >>> /export/home in a zvol. >> >> >> The problem with the EIS boot disk standard, is that it has an >> escape clause which is basically, "or do whatever the customer >> wants." >> >> The personal problem I have with the EIS boot disk standard is >> that it still advocates multiple file systems for the basic OS >> services. This does not offer any real protection against any >> DoS, but does complicate backup/restore and therefore increases >> your TCO (not a good thing.) Since no other volume OSes have >> this affliction, it makes sense to get rid of it for Solaris, too. >> Sometimes we hurt ourselves with good intentions. >> >>> Then once ZFS can boot, set up two disks in a zpool and mirror >>> them--to at least work with the firmware to be able to boot the >>> systems. Then the rest of the disks attached to the system, put >>> into a zpool and toss /export/home in that zpool. >> >> >> I think the prevailing wind will go to hardware RAID for boot disks, >> at least in the server space. But in any case, I think ZFS mirroring >> will be much simpler to implement and manage than LVM-based mirroring >> of the boot disk. >> -- richard >> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > ----- > Gregory Shaw, IT Architect > Phone: (303) 673-8273 Fax: (303) 673-8273 > ITCTO Group, Sun Microsystems Inc. > 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) > Louisville, CO 80028-4382 shaw at fmsoft.com (home) > "When Microsoft writes an application for Linux, I''ve Won." - Linus > Torvalds > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
The only reason I ever use more then / when installing an OS is because of performance or security. I like to be able to set noatime or noexec on some slices. If it was something I could set on a directory level I would never use more the / again. Having to slice up a disk is a design flaw. -Sean On Apr 10, 2006, at 10:34 AM, Gregory Shaw wrote:> If I read the below right, other volume OSs use one big partition? > > That''s not true for Linux and HP-UX, *BSD and probably others. > > ZFS changes the picture significantly, as it doesn''t ''think'' in > filesystems any longer. Is there a reason for ZFS to be broken > into smaller pieces? I''m expecting to have a UFS boot slice > (something small that the kernel bootloader will read) and > everything else ZFS. > > The only reason I could see to have discrete ZFS pools would be a > ''crown jewels'' isolation where something needs to be separate from > the rest of the data store. Does anybody know of a solution that > would require that separation? > > > On Apr 10, 2006, at 10:48 AM, Richard Elling wrote: > >> On Mon, 2006-04-10 at 05:05 -0700, Dennis J. Behrens wrote: >>> I imagine this reply is a tad late, but heck I felt the need to >>> reply to this thread. >> >> Not late, keep it coming... I''ve got a draft of a summary almost >> ready. >> >>> While I suspect that most knowledgeable Sun shops will take the >>> box, rack mount it, and then jumpstart the system. These are >>> typically the Sun customers who will actually mirror their >>> rootdisks and worry about disk layout. >> >> Yes. For some reason, some sys admins believe that knowing how >> to slice up a disk will ensure their continued employment. I see >> such complexity as a bug^H^H design deficiency. >> >>> What Sun should worry about is the layout for the other Sun >>> customers who don''t jumpstart their systems. I''ve installed >>> systems at some customer sites where that Sun box is the only one >>> in their datacenter. Those customers want to be able to plug in >>> the system and go. This is where the pre-installed image will be >>> used. >> >> Yes. >> >>> Having said that, I believe the current EIS standards are fine >>> for the pre-zfs world. The only major change I would suggest is >>> putting /export/home in a zvol. >> >> The problem with the EIS boot disk standard, is that it has an >> escape clause which is basically, "or do whatever the customer >> wants." >> >> The personal problem I have with the EIS boot disk standard is >> that it still advocates multiple file systems for the basic OS >> services. This does not offer any real protection against any >> DoS, but does complicate backup/restore and therefore increases >> your TCO (not a good thing.) Since no other volume OSes have >> this affliction, it makes sense to get rid of it for Solaris, too. >> Sometimes we hurt ourselves with good intentions. >> >>> Then once ZFS can boot, set up two disks in a zpool and mirror >>> them--to at least work with the firmware to be able to boot the >>> systems. Then the rest of the disks attached to the system, put >>> into a zpool and toss /export/home in that zpool. >> >> I think the prevailing wind will go to hardware RAID for boot disks, >> at least in the server space. But in any case, I think ZFS mirroring >> will be much simpler to implement and manage than LVM-based mirroring >> of the boot disk. >> -- richard >> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ----- > Gregory Shaw, IT Architect > Phone: (303) 673-8273 Fax: (303) 673-8273 > ITCTO Group, Sun Microsystems Inc. > 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) > Louisville, CO 80028-4382 shaw at fmsoft.com (home) > "When Microsoft writes an application for Linux, I''ve Won." - Linus > Torvalds > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard Elling
2006-Apr-10 21:02 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
On Mon, 2006-04-10 at 11:34 -0600, Gregory Shaw wrote:> If I read the below right, other volume OSs use one big partition? > > That''s not true for Linux and HP-UX, *BSD and probably others.MicroSoft Windows and Mac OSX are the big volume OSes. I''ve only installed a couple of Linux distros and all of them showed one file system. HP-UX is definitely not a volume OS. I suppose Mac OSX is the volume distro for *BSD. -- richard
On Apr 10, 2006, at 17:02, Richard Elling wrote:> MicroSoft Windows and Mac OSX are the big volume OSes. > I''ve only installed a couple of Linux distros and all of them > showed one file system. HP-UX is definitely not a volume OS. > I suppose Mac OSX is the volume distro for *BSD.Under Windows (DOS) you have the ''namespaces'' of drive letters. The OS X file system is taken care of by the Darwin / BSD layer, so you can mount drives and volumes anywhere, Apple simply chooses to use one slice for things by default.
Sanjay Nadkarni
2006-Apr-11 01:11 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
I often assign fairly large swap partitions simply because some of my applications make heavy use of /tmp (tmpfs). Is this a possible consideration ? -Sanjay Dennis J. Behrens wrote:> On Mon, 10 Apr 2006, Mike Gerdts wrote: > >> On 4/10/06, Dennis J. Behrens <dbehrens at second-2-none.net> wrote: >> >>> In both configs just put swap in as large enough to handle the >>> crashdump, >>> which is fine for most systems, other than Oracle servers or SAP >>> installations. >>> But typically those environments have some admin on staff who is >>> competent >>> to add more swap. And I''d think that putting the extra swap in a >>> file within >>> zfs would be as good if not better than a swap partition. >> >> >> Why would you suggest that swap should be larger for Oracle? Are you >> suggesting that it is more efficient to cause Oracle to page to swap >> (let the OS guess what it is doing) rather than tuning Oracle to only >> use RAM and page to its data files per its algorithms? >> >> Mike > > > I wasn''t intending on suggesting that it''s better to have larger swap > space for performance reasons for Oracle servers. It''s just more of > an observations that I have encountered at many customer sites. > Obviously for optimal performance you want to tune Oracle to keep > itself in RAM rather than being paged out to swap. > > --Dennis > > -- > Dennis J. Behrens dbehrens at second-2-none.net > "If you insist on using Windoze you''re on your own." > "I sense much windows in you, Windows leads to bluescreens, Bluescreens > leads to crashing, Crashing leads to...Suffering" > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 4/10/06, Darren J Moffat <Darren.Moffat at sun.com> wrote:> Mike Gerdts wrote:> > > FWIW, on large systems I tend to size swap based upon the space required > > for a crash dump. When Sun, Oracle, and Veritas have reviewed these > > configurations, they have raised no red flags. > > An alternative thing to do in those cases is not to have any swap at all > and use dumpadm(1m) to configure that disk area as a dedicated dump device.Actually, I do also allocate a dedicated dump device on 15k''s and 25k''s, primarily to improve reboot performance. It takes a long time to read 2 - 6 GB from swap and write 6 - 17 GB to /var, particularly if they are on the same spindles. Keeping some swap around is useful because when an app allocates but does not touch memory, a reservation against swap is taken. Once the app touches the memory, a free page of RAM is found for it. Consider the following code: int main() { malloc(1024*1024*50); sleep(300); } When it is run, if you take a look at the heap you will see that the RSS for the heap is not that big: $ pmap -x 20512 20512: ./malloc Address Kbytes RSS Anon Locked Mode Mapped File 00010000 8 8 - - r-x-- malloc 00020000 8 8 8 - rwx-- malloc 00022000 51208 16 16 - rwx-- [ heap ] FF280000 688 688 - - r-x-- libc.so.1 FF33C000 32 32 32 - rwx-- libc.so.1 FF380000 8 8 8 - rwx-- [ anon ] FF390000 8 8 - - r-x-- libc_psr.so.1 FF39A000 8 8 8 - rwx-- libdl.so.1 FF3A0000 8 8 - - r--s- dev:136,0 ino:50937 FF3B0000 184 184 - - r-x-- ld.so.1 FF3EE000 8 8 8 - rwx-- ld.so.1 FF3F0000 8 8 8 - rwx-- ld.so.1 FFBFC000 16 16 16 - rwx-- [ stack ] -------- ------- ------- ------- ------- total Kb 52192 1000 104 - Thus, for progams that allocate a big chunk of RAM but are slow to write to it, you may see improved startup time, responsiveness, etc. if it doesn''t have to possibly cause a bunch of paging activity (page outs, page scanning, etc.) for all of the required pages. Or at least that is my understanding of why you want swap even if you never intend to touch it. I welcome corrections. Mike -- Mike Gerdts http://mgerdts.blogspot.com/
On 3/23/06, Richard Elling <Richard.Elling at sun.com> wrote:> We are preparing for the day when ZFS will be preinstalled > on systems at Sun''s factory. This design will occur in two phases: > before and after ZFS is a bootable file system. We''re interested > in feedback from the community on how to allocate the disk > resources. >i would sudgest that we create a small / + swap on the drive possibly 9 / 18 GB if two or more disks are installed we mirror this space, when the customer recieves the machine on first boot he is prompted what to do with the remaining space. Your machine has XXX GB of space unallocated on it, would you like to add that to a new or existing zfs pool? 1. add to the primary OS pool 2. add to a secondary pool 3. leave unallocated. Your coice (1-3): if the user chooses 1 or 2, then prompt with the following raidz will only be presented if 3 or more drives are present. Please choose the level of data protection desired: 1. use raid 0 ( no redunctancy) if a drive fails your data is lost if not backed up. 2. mirror raid 1 remaining storage. you will have XXX GB availble storage if this option is chosen 3. raidz the remaining storage, you will have XXX GB availibe storage if this option is chosen Your choice(1-3): this scheme will probably break jumpstart scripts but zfs will proably do that anyway. James Dickens uadmin.blogspot.com> In systems currently shipping from the factory with Solaris > preinstalled, the boot disk layout follows the Enterprise > Installation Services (EIS) boot disk standard. There is > enough uhmm.. flexibility in the slice allocations in that > standard that it is unlikely to mesh with most customer > standards, where they exist. With UFS, depending on the > product, we allocate slice 1 for /, slice 2 for swap, etc. > The size of the allocations again depends on the size of > the boot disk and the system. In other words, there is > significant variance between systems because once > the allocations are made, they are difficult to change (UFS) > and may require reinstallation of the OS. > > Currently non-boot disks are not preallocated or preinstalled. > > Other vendors often assign the boot disk as one big > partition and put everything in that partition. Most Sun docs > follow the traditional, old UNIX method of having different > slices (partitions) for /, swap, et.al. > > Another way to look at this is that whatever UFS allocation > we make for the factory, it is wrong for a significant number > of customers. Hence the request for feedback. > > 1. Before ZFS is a bootable file system: > > We still have the issue where the boot disk requires pre- > allocated UFS slices. The current thinking is that we won''t > change the boot disk from existing EIS standards. However > we''d like to add the other, non-boot disks in the system as > a big zpool. > > 2. After ZFS is a bootable file system: > > Just put all disks in a big zpool. > > > We will use best practices for RAS in the zpool based upon the > capabilities of the system with the priority policy such that data > should be protected first and free space optimized second. > > Systems which can support hardware RAID of the bootdisk, > may or may not be mirrored using hardware RAID. That > decision will be made later, on a case-by-case basis, and > doesn''t materially affect the disk layout policies. > > Our thinking is that once you have a big zpool, it is very easy to > add, delete, or change file systems, zvols, or whatever. In > other words, whatever choices we make for file system allocations > in the factory cannot be wrong. If you don''t like it, changing it > does not require a reinstall of the OS. Life is much simpler. > Simple is good. > > Comments? > -- richard > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Nicolas Williams
2006-Apr-11 05:17 UTC
[zfs-discuss] Preinstallation of ZFS at the factory
On Mon, Apr 10, 2006 at 11:54:41PM -0500, James Dickens wrote:> On 3/23/06, Richard Elling <Richard.Elling at sun.com> wrote: > > We are preparing for the day when ZFS will be preinstalled > > on systems at Sun''s factory. This design will occur in two phases: > > before and after ZFS is a bootable file system. We''re interested > > in feedback from the community on how to allocate the disk > > resources. > > > i would sudgest that we create a small / + swap on the drive possibly > 9 / 18 GB if two or more disks are installed we mirror this space, > when the customer recieves the machine on first boot he is prompted > what to do with the remaining space.Why not one big pool though? If the user has other storage then they won''t care much about the factory root disks. If they don''t then one big pool on the factory root disks will be just what they wanted anyways, with reservations and quotas being enough to manage that storage. I''d rather see the install not ask anything about this. If you really don''t want the pre-install deal, re-install. Nico --
On 4/11/06, Nicolas Williams <Nicolas.Williams at sun.com> wrote:> On Mon, Apr 10, 2006 at 11:54:41PM -0500, James Dickens wrote: > > On 3/23/06, Richard Elling <Richard.Elling at sun.com> wrote: > > > We are preparing for the day when ZFS will be preinstalled > > > on systems at Sun''s factory. This design will occur in two phases: > > > before and after ZFS is a bootable file system. We''re interested > > > in feedback from the community on how to allocate the disk > > > resources. > > > > > i would sudgest that we create a small / + swap on the drive possibly > > 9 / 18 GB if two or more disks are installed we mirror this space, > > when the customer recieves the machine on first boot he is prompted > > what to do with the remaining space. > > Why not one big pool though? If the user has other storage then they > won''t care much about the factory root disks. If they don''t then one > big pool on the factory root disks will be just what they wanted > anyways, with reservations and quotas being enough to manage that > storage. >because most admins and bluebooks recomend that root be mirrored, and initially zfs won''t allow / to be a raidz, if all filespace is mirrored that forces the user to give up 50% of disk space when raidz is safe enough some customers. How do we decide if the customer wants raidz or mirror if we give put all drives in one large pool. giving them the choice makes there life easier. marketing hat on: This method also becomes a marketing tool, letting the customer know that there system is zfs enabled out of the box. ZFS is allready feature everyone is waiting for, no one else has it we might as brag about it. marketing hat off: James> I''d rather see the install not ask anything about this. If you really > don''t want the pre-install deal, re-install. > > Nico > -- >
Richard Elling
2006-Apr-11 05:31 UTC
[zfs-discuss] Re: Preinstallation of ZFS at the factory
On Mon, 2006-04-10 at 20:17 -0500, Mike Gerdts wrote:> On 4/10/06, Darren J Moffat <Darren.Moffat at sun.com> wrote: > > Mike Gerdts wrote:> > > > FWIW, on large systems I tend to size swap based upon the space required > > > for a crash dump. When Sun, Oracle, and Veritas have reviewed these > > > configurations, they have raised no red flags. > > > > An alternative thing to do in those cases is not to have any swap at all > > and use dumpadm(1m) to configure that disk area as a dedicated dump device. > > Actually, I do also allocate a dedicated dump device on 15k''s and > 25k''s, primarily to improve reboot performance. It takes a long time > to read 2 - 6 GB from swap and write 6 - 17 GB to /var, particularly > if they are on the same spindles.Not needed in S10. Prior to S10, the crash dump collection is done serially during boot, via an rc-script I can''t remember the name of anymore. In S10, it is now an SMF service and operates in parallel with the other services during boot. Though it does cause some I/O, the actual amount of I/O issued during boot is remarkably small. -- richard
Nicolas Williams
2006-Apr-11 05:34 UTC
[zfs-discuss] Preinstallation of ZFS at the factory
On Tue, Apr 11, 2006 at 12:30:23AM -0500, James Dickens wrote:> On 4/11/06, Nicolas Williams <Nicolas.Williams at sun.com> wrote: > > Why not one big pool though? If the user has other storage then they > > won''t care much about the factory root disks. If they don''t then one > > big pool on the factory root disks will be just what they wanted > > anyways, with reservations and quotas being enough to manage that > > storage. > > > because most admins and bluebooks recomend that root be mirrored, and > initially zfs won''t allow / to be a raidz, if all filespace is > mirrored that forces the user to give up 50% of disk space when raidz > is safe enough some customers. How do we decide if the customer wants > raidz or mirror if we give put all drives in one large pool. giving > them the choice makes there life easier.We have to balance install simplicity and flexibility. If you''ll have JBODs for RAID-Z then you probably won''t care about "wasting" the root disk storage.> marketing hat on: > > This method also becomes a marketing tool, letting the customer know > that there system is zfs enabled out of the box. ZFS is allready > feature everyone is waiting for, no one else has it we might as brag > about it. > marketing hat off:Can''t we do that without asking questions though? Nico --
On 4/11/06, Richard Elling <Richard.Elling at sun.com> wrote:> > Actually, I do also allocate a dedicated dump device on 15k''s and > > 25k''s, primarily to improve reboot performance. It takes a long time > > to read 2 - 6 GB from swap and write 6 - 17 GB to /var, particularly > > if they are on the same spindles. > > Not needed in S10. Prior to S10, the crash dump collection is > done serially during boot, via an rc-script I can''t remember the > name of anymore. In S10, it is now an SMF service and operates in > parallel with the other services during boot. Though it does cause > some I/O, the actual amount of I/O issued during boot is remarkably > small.Good point. I''ll have to go look at the dependencies to see what it will slow down. That is, swap cannot safely be enabled before savecore is complete if dedicated dump devices are not used. Presumably enabling swap is part of one of the earlier milestones. Experience so far has been with Solaris 8 and Solaris 9 on these boxes - now I guess I have another thing to look into. As for amount of I/O... Crash dumps in the range of 5 GB to 10 GB are not terribly uncommon on sufficiently large busy servers. The largest I have seen was either 14 GB or 17 GB. Let''s pretend that the boot disks can sustain writes at 20 MB/s while reading the compressed image from the same spindles. This means that for a 10 GB vmcore file, savecore will run for 8.5 minutes. Given that whoever is on the phone with you will be asking for updates every 30 seconds, that means that you hear "Are we there yet?" 17 times. :) Now back to the previous discussion on ZFS... Mike -- Mike Gerdts http://mgerdts.blogspot.com/
Richard Elling
2006-Apr-13 17:37 UTC
[zfs-discuss] Re: Re: Preinstallation of ZFS at the factory
> I often assign fairly large swap partitions simply because some of my > applications make heavy use of /tmp (tmpfs). Is this a possible > consideration ?Sure, as long as you don''t expect performance. If you need more performance, add more RAM. -- richard This message posted from opensolaris.org