Hi all We''re working on replacing our current fileserver with something based on either Solaris or NexentaStor. We have about 200 users with variable needs. There will also be a few common areas for each department and perhaps a backup area. I think these should be separated with datasets, for simplicity and overview, but I''m not sure if it''s a good idea. I have read people are having problems with lengthy boot times with lots of datasets. We''re planning to do extensive snapshotting on this system, so there might be close to a hundred snapshots per dataset, perhaps more. With 200 users and perhaps 10-20 shared department datasets, the number of filesystems, snapshots included, will be around 20k or more. Will trying such a setup be betting on help from some god, or is it doable? The box we''re planning to use will have 48 gigs of memory and about 1TB L2ARC (shared with SLOG, we just use some slices for that). Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Roy Sigurd Karlsbakk wrote:> > I have read people are having problems with lengthy boot times with lots of datasets. We''re planning to do extensive snapshotting on this system, so there might be close to a hundred snapshots per dataset, perhaps more. With 200 users and perhaps 10-20 shared department datasets, the number of filesystems, snapshots included, will be around 20k or more.In my experience the boot time mainly depends on the number of datasets, not the number of snapshots. 200 datasets is fairly easy (we have >7000, but did some boot-time tuning).> > Will trying such a setup be betting on help from some god, or is it doable? The box we''re planning to use will have 48 gigs of memory and about 1TB L2ARC (shared with SLOG, we just use some slices for that).Try. The main problem with having many snapshots is the time used for zfs list, because it has to scrape all the information from disk, but with having so much RAM/L2ARC that shouldn''t be a problem here. Another thing to consider is the frequency with which you plan to take the snap- shots and if you want individual schedules for each dataset. Taking a snapshot is a heavy-weight operation as it terminates the current txg. Btw, what did you plan to use as L2ARC/slog? --Arne> > Vennlige hilsener / Best regards > > roy > -- > Roy Sigurd Karlsbakk > (+47) 97542685 > roy at karlsbakk.net > http://blogg.karlsbakk.net/ > -- > I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Jun 20, 2010, at 11:55, Roy Sigurd Karlsbakk wrote:> There will also be a few common areas for each department and > perhaps a backup area.The back up area should be on a different set of disks. IMHO, a back up isn''t a back up unless it is an /independent/ copy of the data. The copy can be made via ZFS send/recv, tar, rsync, Legato/ NetBackup, etc., but it needs to be on independent media. Otherwise, if the original copy goes, so does the "backup".> I have read people are having problems with lengthy boot times with > lots of datasets. We''re planning to do extensive snapshotting on > this system, so there might be close to a hundred snapshots per > dataset, perhaps more. With 200 users and perhaps 10-20 shared > department datasets, the number of filesystems, snapshots included, > will be around 20k or more.You may also want to consider breaking things up into different pools as well. There seems to be an implicit assumption in this conversation that everything will be in one pool, and that may not be the best course of action. Perhaps one pool for users'' homedirs, and another for the departmental stuff? Or perhaps even two different pools for homedirs, with users ''randomly'' distributed between the two (though definitely don''t do something like alphabetical (it''ll be non-even) or departmental (people transfer) distribution). This could add a bit of overhead, but I don''t think have 2 or 3 pools would be much more of a big deal than one.
On 06/21/10 03:55 AM, Roy Sigurd Karlsbakk wrote:> Hi all > > We''re working on replacing our current fileserver with something based on either Solaris or NexentaStor. We have about 200 users with variable needs. There will also be a few common areas for each department and perhaps a backup area. I think these should be separated with datasets, for simplicity and overview, but I''m not sure if it''s a good idea. > > I have read people are having problems with lengthy boot times with lots of datasets. We''re planning to do extensive snapshotting on this system, so there might be close to a hundred snapshots per dataset, perhaps more. With 200 users and perhaps 10-20 shared department datasets, the number of filesystems, snapshots included, will be around 20k or more. > >200 user filesystems isn''t too big. One of the systems I look after has about 1100 user filesystems with up to 20 snapshots each. The impact on boot time is minimal. -- Ian.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk > > Will trying such a setup be betting on help from some god, or is it > doable? The box we''re planning to use will have 48 gigs of memory andThere''s nothing difficult about it. Go ahead and test. Personally, I don''t see much value in using lots of separate filesystems. They''re all in the same pool, right? I use one big filesystem. There are legitimate specific reasons to use separate filesystems in some circumstances. But if you can''t name one reason why it''s better ... then it''s not better for you.
On 21/06/10 12:58 PM, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk >> >> Will trying such a setup be betting on help from some god, or is it >> doable? The box we''re planning to use will have 48 gigs of memory and > > There''s nothing difficult about it. Go ahead and test. > > Personally, I don''t see much value in using lots of separate filesystems.> They''re all in the same pool, right? I use one big filesystem.> There are legitimate specific reasons to use separate filesystems in some> circumstances. But if you can''t name one reason why it''s better ... > then it''s not better for you. On the build systems that I maintain inside the firewall, we mandate one filesystem per user, which is a very great boon for system administration. My management scripts are considerably faster running when I don''t have to traverse whole directory trees (ala ufs). James C. McPherson -- Senior Software Engineer, Solaris Oracle http://www.jmcp.homeunix.com/blog
----- Original Message -----> On Jun 20, 2010, at 11:55, Roy Sigurd Karlsbakk wrote: > > > There will also be a few common areas for each department and > > perhaps a backup area. > > The back up area should be on a different set of disks. > > IMHO, a back up isn''t a back up unless it is an /independent/ copy of > the data. The copy can be made via ZFS send/recv, tar, rsync, Legato/ > NetBackup, etc., but it needs to be on independent media. Otherwise, > if the original copy goes, so does the "backup".I think you misunderstand me here. The backup area will be a storage area for Ahsay (see http://www.ahsay.com/ ) for client and application (Oracle, Sybase, Exchange etc). All datasets will be copied to a secondary node either with ZFS Send/receive or (more probably) NexentaStore HA Cluster ( http://kurl.no/KzHU ).> > I have read people are having problems with lengthy boot times with > > lots of datasets. We''re planning to do extensive snapshotting on > > this system, so there might be close to a hundred snapshots per > > dataset, perhaps more. With 200 users and perhaps 10-20 shared > > department datasets, the number of filesystems, snapshots included, > > will be around 20k or more. > > You may also want to consider breaking things up into different pools > as well. There seems to be an implicit assumption in this conversation > that everything will be in one pool, and that may not be the best > course of action. > > Perhaps one pool for users'' homedirs, and another for the departmental > stuff? Or perhaps even two different pools for homedirs, with users > ''randomly'' distributed between the two (though definitely don''t do > something like alphabetical (it''ll be non-even) or departmental > (people transfer) distribution). > > This could add a bit of overhead, but I don''t think have 2 or 3 pools > would be much more of a big deal than one.So far the plan is to keep it in one pool for design and administration simplicity. Why would you want to split up (net) 40TB into more pools? Seems to me that''ll mess up things a bit, having to split up SSDs for use on different pools, loosing the flexibility of a common pool etc. Why? Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
> Btw, what did you plan to use as L2ARC/slog?I was thinking of using four Crucial RealSSD 256MB SSDs with a small RAID1+0 for SLOG and the rest for L2ARC. The system will be mainly used for reads, so I don''t think the SLOG needs will be too tough. If you have another suggestion, please tell :) Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk. -- Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On Jun 21, 2010, at 05:00, Roy Sigurd Karlsbakk wrote:> So far the plan is to keep it in one pool for design and > administration simplicity. Why would you want to split up (net) 40TB > into more pools? Seems to me that''ll mess up things a bit, having to > split up SSDs for use on different pools, loosing the flexibility of > a common pool etc. Why?If different groups or areas have different I/O characteristics for one. If in one case (users) you want responsiveness, you could go with striped-mirrors. However, if departments have lots of data, it may be worthwhile to put it on a RAID-Z pool for better storage efficiency. Just a thought.
----- Original Message -----> On Jun 21, 2010, at 05:00, Roy Sigurd Karlsbakk wrote: > > > So far the plan is to keep it in one pool for design and > > administration simplicity. Why would you want to split up (net) 40TB > > into more pools? Seems to me that''ll mess up things a bit, having to > > split up SSDs for use on different pools, loosing the flexibility of > > a common pool etc. Why? > > If different groups or areas have different I/O characteristics for > one. If in one case (users) you want responsiveness, you could go with > striped-mirrors. However, if departments have lots of data, it may be > worthwhile to put it on a RAID-Z pool for better storage efficiency.We have considered RAID-1+0 and concluded with no current needs for this, as of now. Close to 1TB SSD cache will also help to boost read speeds, so I think it will be sufficient, at least for now. About different I/O characteristics in different groups/areas, this is not something we have data on for now. Do you know a good way to check this? The data is located on two different zpools (sol10) today. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
David Magda wrote:> On Jun 21, 2010, at 05:00, Roy Sigurd Karlsbakk wrote: > >> So far the plan is to keep it in one pool for design and >> administration simplicity. Why would you want to split up (net) 40TB >> into more pools? Seems to me that''ll mess up things a bit, having to >> split up SSDs for use on different pools, loosing the flexibility of a >> common pool etc. Why? > > If different groups or areas have different I/O characteristics for one. > If in one case (users) you want responsiveness, you could go with > striped-mirrors. However, if departments have lots of data, it may be > worthwhile to put it on a RAID-Z pool for better storage efficiency. >Especially if the characteristics are different I find it a good idea to mix all on one set of spindles. This way you have lots of spindles for fast access and lots of space for the sake of space. If you devide the available spindles in two sets you will have much fewer spindles available for the responsiveness goal. I don''t think taking them into a mirror can compensate that. --Arne
> From: James C. McPherson [mailto:jmcp at opensolaris.org] > > On the build systems that I maintain inside the firewall, > we mandate one filesystem per user, which is a very great > boon for system administration.What''s the reasoning behind it?> My management scripts are > considerably faster running when I don''t have to traverse > whole directory trees (ala ufs).That''s a good reason. Why would you have to traverse whole directory structures if you had a single zfs filesystem in a single zpool, instead of many zfs filesystems in a single zpool?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk > > Close to 1TB SSD cache will also help to boost read > speeds,Remember, this will not boost large sequential reads. (Could possibly maybe even hurt it.) This will only boost random reads.
On 21/06/10 10:38 PM, Edward Ned Harvey wrote:>> From: James C. McPherson [mailto:jmcp at opensolaris.org] >> >> On the build systems that I maintain inside the firewall, >> we mandate one filesystem per user, which is a very great >> boon for system administration. > > What''s the reasoning behind it?Politeness, basically. Every user on these machines is expected to make and use their own disk-space sandpit - having their own dataset makes that work nicely.>> My management scripts are >> considerably faster running when I don''t have to traverse >> whole directory trees (ala ufs). > > That''s a good reason. Why would you have to traverse whole> directory structures if you had a single zfs filesystem in > a single zpool, instead of many zfs filesystems in a single zpool? For instance, if I''ve got users a, b and c, who have their own datasets, and users z, y and x who do not: df -h /builds/[abczyx] will show me disk usage of /builds for z, y and x, but /builds/a /builds/b /builds/c for the ones who do have their own dataset. So when I''m trying to figure out who I need to yell at because they''re using more than our acceptable limit (30Gb), I have to run "du -s /builds/[zyx]". And that takes time. Lots of time. Especially on these systems which are in huge demand from people all over Solaris-land. James C. McPherson -- Senior Software Engineer, Solaris Oracle http://www.jmcp.homeunix.com/blog
On 21/06/2010 13:59, James C. McPherson wrote:> On 21/06/10 10:38 PM, Edward Ned Harvey wrote: >>> From: James C. McPherson [mailto:jmcp at opensolaris.org] >>> >>> On the build systems that I maintain inside the firewall, >>> we mandate one filesystem per user, which is a very great >>> boon for system administration. >> >> What''s the reasoning behind it? > > Politeness, basically. Every user on these machines is expected > to make and use their own disk-space sandpit - having their own > dataset makes that work nicely.Plus it allows delegation of snapshot/clone/send/recv to the users on certain systems. -- Darren J Moffat
On Mon, Jun 21, 2010 at 8:59 AM, James C. McPherson <jmcp at opensolaris.org> wrote: [...]> So when I''m > trying to figure out who I need to yell at because they''re > using more than our acceptable limit (30Gb), I have to run > "du -s /builds/[zyx]". And that takes time. Lots of time.[...] Why not just use quotas? fpsm
On Mon, 21 Jun 2010, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk >> >> Close to 1TB SSD cache will also help to boost read speeds, > > Remember, this will not boost large sequential reads. (Could > possibly maybe even hurt it.) This will only boost random reads.Or more accurately, it boosts repeated reads. It won''t help much in the case where data is accessed only once. It is basically a poor-man''s substitute for caching data in RAM. The RAM is at least 20X faster so the system should be stuffed with RAM first as long as the budget can afford it. Luckily, most servers experience mostly repeated reads. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Mon, 21 Jun 2010, Arne Jansen wrote:> > Especially if the characteristics are different I find it a good > idea to mix all on one set of spindles. This way you have lots of > spindles for fast access and lots of space for the sake of space. If > you devide the available spindles in two sets you will have much > fewer spindles available for the responsiveness goal. I don''t think > taking them into a mirror can compensate that.This is something that I can agree with. Total vdevs in the pool is what primarily determines its responsiveness. while using the same number of devices, splitting the pool might not result in more vdevs in either pool. Mirrors do double the amount of readable devices but the side selected to read is random so the actual read performance improvement is perhaps on the order of 50% rather than 100%. Raidz does steal IOPS so smaller raidz vdevs will help and result in more vdevs in the pool. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
----- Original Message -----> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk > > > > Close to 1TB SSD cache will also help to boost read > > speeds, > > Remember, this will not boost large sequential reads. (Could possibly > maybe even hurt it.) This will only boost random reads.As far as I can see, we mostly have random reads, and not too much large sequential I/O, so this is what I''m looking for. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On 22/06/10 01:05 AM, Fredrich Maney wrote:> On Mon, Jun 21, 2010 at 8:59 AM, James C. McPherson > <jmcp at opensolaris.org> wrote: > [...] >> So when I''m >> trying to figure out who I need to yell at because they''re >> using more than our acceptable limit (30Gb), I have to run >> "du -s /builds/[zyx]". And that takes time. Lots of time. > [...] > > Why not just use quotas?Quotas are not always appropriate. Also, given our usage model, and wanting to provide a service that gatelings can use to work on multiple changesets concurrently, we figure that telling people your limit is XGb, and we will publicly shame you if you exceed it, then go and remove old stuff for you is sufficiently hands-off. We''re adults here, not children or kiddies with no regard for our fellow engineers. James C. McPherson -- Senior Software Engineer, Solaris Oracle http://www.jmcp.homeunix.com/blog
On Sun, 20 Jun 2010, Arne Jansen wrote:> In my experience the boot time mainly depends on the number of datasets, > not the number of snapshots. 200 datasets is fairly easy (we have >7000, > but did some boot-time tuning).What kind of boot tuning are you referring to? We''ve got about 8k filesystems on an x4500, it takes about 2 hours for a full boot cycle which is kind of annoying. The majority of that time is taken up with NFS sharing, which currently scales very poorly :(. Thanks... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson at csupomona.edu California State Polytechnic University | Pomona CA 91768
Paul B. Henson wrote:> On Sun, 20 Jun 2010, Arne Jansen wrote: > >> In my experience the boot time mainly depends on the number of datasets, >> not the number of snapshots. 200 datasets is fairly easy (we have >7000, >> but did some boot-time tuning). > > What kind of boot tuning are you referring to? We''ve got about 8k > filesystems on an x4500, it takes about 2 hours for a full boot cycle which > is kind of annoying. The majority of that time is taken up with NFS > sharing, which currently scales very poorly :(.As you said most of the time is spent for nfs sharing, but mounting also isn''t as fast as it could be. We found that the zfs utility is very inefficient as it does a lot of unnecessary and costly checks. We set mountpoint to legacy and handle mounting/sharing ourselves in a massively parallel fashion (50 processes). Using the system utilities makes things a lot better, but you can speed up sharing a lot more by setting the SHARE_NOINUSE_CHECK environment variable before invoking share(1M). With this you should be able to share your tree in about 10 seconds. Good luck, Arne> > Thanks... > >
Arne Jansen wrote:> Paul B. Henson wrote: >> On Sun, 20 Jun 2010, Arne Jansen wrote: >> >>> In my experience the boot time mainly depends on the number of datasets, >>> not the number of snapshots. 200 datasets is fairly easy (we have >7000, >>> but did some boot-time tuning). >> >> What kind of boot tuning are you referring to? We''ve got about 8k >> filesystems on an x4500, it takes about 2 hours for a full boot cycle >> which >> is kind of annoying. The majority of that time is taken up with NFS >> sharing, which currently scales very poorly :(. > > As you said most of the time is spent for nfs sharing, but mounting also > isn''t > as fast as it could be. We found that the zfs utility is very > inefficient as > it does a lot of unnecessary and costly checks. We set mountpoint to legacy > and handle mounting/sharing ourselves in a massively parallel fashion (50 > processes). Using the system utilities makes things a lot better, but you > can speed up sharing a lot more by setting the SHARE_NOINUSE_CHECK > environment > variable before invoking share(1M). With this you should be able to > share your > tree in about 10 seconds.I forgot the disclaimer: you can crash your machine if you call share with improper arguments if you set this flag. iirc it skips a check if the fs is already shared, so it cannot handle a re-share properly.> > Good luck, > Arne > >> >> Thanks... >> >> > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Tue, 22 Jun 2010, Arne Jansen wrote:> We found that the zfs utility is very inefficient as it does a lot of > unnecessary and costly checks.Hmm, presumably somebody at Sun doesn''t agree with that assessment or you''d think they''d take them out :). Mounting/sharing by hand outside of the zfs framework does make a huge difference. It takes about 45 minutes to mount/share or unshare/unmount with the mountpoint and sharenfs zfs properties set, mounting/sharing by hand with SHARE_NOINUSE_CHECK=1 even just sequentially only took about 2 minutes. With some parallelization I could definitely see hitting that 10 seconds you mentioned, which would sure make my patch windows a hell of a lot shorter. I''ll need put together a script and fiddle some with smf, joy oh joy, I need these filesystems mounted before the web server starts. Thanks much for the tip! I''m hoping someday they''ll clean up the sharing implementation and make it a bit more scalable. I had a ticket open once and they pretty much said it would never happen for Solaris 10, but maybe sometime in the indefinite future for OpenSolaris... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson at csupomona.edu California State Polytechnic University | Pomona CA 91768
Edward Ned Harvey <solaris2 at nedharvey.com> writes:> There are legitimate specific reasons to use separate filesystems > in some circumstances. But if you can''t name one reason why it''s > better ... then it''s not better for you.Having separate filesystems per user lets you create user-specific quotas and reservations, lets you allow users to make their own snapshots, and lets you do zfs send/recv replication of single user home directories (for backup or move to another pool), and even allow the users to do that on their own. -- Usenet is not a right. It is a right, a left, a jab, and a sharp uppercut to the jaw. The postman hits! You have new mail.