Greetings, all. Does anyone have a good whitepaper or three on how ZFS uses memory and swap? I did some Googling, but found nothing that was useful. The reason I ask is that we have a small issue with some of our DBA''s. We have a server with 16GB of memory, and they are looking at moving over databases to it from a smaller system. The catch is that they are moving to 10g. Oracle suggests a 2GB SGA. They are using top to determine how many databases they can fit on the server (I know, I know...not the right tool) based on top''s reporting of memory and swap usage. What they did as a test is to fire up a single database with a 10GB SGA "to simulate 5 2GB databases" while running top. The system could not de-allocate the memory from the ZFS (and any other) cache, re-allocate it to the database, and start the database, all in under 2 minutes. A few moments later (we didn''t get any times on their top screenshots), the 10GB DB was able to start. What I''m basically looking for is information, and perhaps the best use of vmstat, et al. to show them that the server can indeed handle several databases started up in a realistic manner, and explain to them how memory is used by ZFS, and let go when other applications require said memory. I know this is a bit of a strange one, but these DBA''s seem to insist that everything work the same as it did using UFS under Solaris 8 (and that top is the Holder of the Truth(tm)), and we need to prove to them using well-reasoned arguments that the changes to memory management and usage by Solaris 10 and ZFS do not stop their databases from running properly. The fact that we have other databases running quite happily on other systems using ZFS, including 12 really big databases on a 32GB V880, seems to be irrelevant. Thank you all for any help you can provide. Rainer This message posted from opensolaris.org
Overall ZFS/Database blogs: http://blogs.sun.com/realneel/entry/zfs_and_databases http://blogs.sun.com/roch/entry/zfs_and_oltp Memory: http://www.opensolaris.org/jive/thread.jspa?messageID=82353𔆱 (There are more postings on ram.. just search the forum) This message posted from opensolaris.org
Thanks for the links, but this is not really the kind of data I''m looking for. These focus more on I/O. I need information on the memory cahing, and so on. Specifically, I need data that shows how starting up a 10GB SGA database on a 16GB machine will not be able to flush the ZFS cache as quickly as the DBA''s are assuming, and how to point them to more realistic tests/metrics, and get them away from top''s simplistic viewpoint of memory under S10. Thanks. Rainer This message posted from opensolaris.org
Hello Rainer, Friday, March 16, 2007, 11:11:16 PM, you wrote: RH> Thanks for the links, but this is not really the kind of data I''m RH> looking for. These focus more on I/O. I need information on the RH> memory cahing, and so on. Specifically, I need data that shows how RH> starting up a 10GB SGA database on a 16GB machine will not be able RH> to flush the ZFS cache as quickly as the DBA''s are assuming, and RH> how to point them to more realistic tests/metrics, and get them RH> away from top''s simplistic viewpoint of memory under S10. ZFS should give back memory used for cache to system if applications are demanding it. Right it should but sometimes it won''t. However with databases there''s simple workaround - as you know how much ram all databases will consume at least you can limit ZFS''s arc cache to remaining free memory (and possibly reduce it even more byt 2-3x factor). For details on how to do it see ''C''mon ARC, stay small...'' thread here. So if you have 16GB RAM in a system and want 10GB for SGA + another 2GB for Oracle + 1GB for other kernel resources you are with 3GB left. So I would limit arc c_max to 3GB or even to 1GB. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Thanks for the feedback. Please see below.> ZFS should give back memory used for cache to system > if applications are demanding it. Right it should but sometimes it > won''t. > > However with databases there''s simple workaround - as > you know how much ram all databases will consume at least you can > limit ZFS''s arc cache to remaining free memory (and possibly reduce > it even more byt 2-3x factor). For details on how to do it see ''C''mon > ARC, stay small...'' thread here. > > So if you have 16GB RAM in a system and want 10GB for > SGA + another 2GB for Oracle + 1GB for other kernel resources you > are with 3GB left. > > So I would limit arc c_max to 3GB or even to 1GB.I was of the understanding that this kernel setting was only introduced in newer Nevada builds. Does this actually work under Solaris 10, Update 3? Thanks again. Rainer This message posted from opensolaris.org
Info on tuning the ARC was just recently updated: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Memory_and_Dynamic_Reconfiguration_Recommendations -r Rainer Heilke writes: > Thanks for the feedback. Please see below. > > > ZFS should give back memory used for cache to system > > if applications are demanding it. Right it should but sometimes it > > won''t. > > > > However with databases there''s simple workaround - as > > you know how much ram all databases will consume at least you can > > limit ZFS''s arc cache to remaining free memory (and possibly reduce > > it even more byt 2-3x factor). For details on how to do it see ''C''mon > > ARC, stay small...'' thread here. > > > > So if you have 16GB RAM in a system and want 10GB for > > SGA + another 2GB for Oracle + 1GB for other kernel resources you > > are with 3GB left. > > > > So I would limit arc c_max to 3GB or even to 1GB. > > I was of the understanding that this kernel setting was only introduced in newer Nevada builds. Does this actually work under Solaris 10, Update 3? > > Thanks again. > Rainer > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hello Rainer, Monday, March 19, 2007, 3:07:54 AM, you wrote: RH> Thanks for the feedback. Please see below.>> ZFS should give back memory used for cache to system >> if applications are demanding it. Right it should but sometimes it >> won''t. >> >> However with databases there''s simple workaround - as >> you know how much ram all databases will consume at least you can >> limit ZFS''s arc cache to remaining free memory (and possibly reduce >> it even more byt 2-3x factor). For details on how to do it see ''C''mon >> ARC, stay small...'' thread here. >> >> So if you have 16GB RAM in a system and want 10GB for >> SGA + another 2GB for Oracle + 1GB for other kernel resources you >> are with 3GB left. >> >> So I would limit arc c_max to 3GB or even to 1GB.RH> I was of the understanding that this kernel setting was only RH> introduced in newer Nevada builds. Does this actually work under Solaris 10, Update 3? Yes -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
The updated information states that the kernel setting is only for the current Nevada build. We are not going to use the kernel debugger method to change the setting on a live production system (and do this everytime we need to reboot). We''re back to trying to set their expectations more realistically, and using proper tools to measure memory usage. As I stated at the outset, they are trying to start up a 10GB SGA database within two minutes to "simulate" the start-up of five 2GB databases at boot-up. I sincerely doubt they are going to start all five databases simultaneously within two minutes on a regular boot-up. So, what is the best use of the OS tools (vmstat, etc.) to show them how this would really occur? Rainer This message posted from opensolaris.org
Rainer Heilke writes: > The updated information states that the kernel setting is only for the > current Nevada build. We are not going to use the kernel debugger > method to change the setting on a live production system (and do this > everytime we need to reboot). > > We''re back to trying to set their expectations more realistically, and > using proper tools to measure memory usage. As I stated at the outset, > they are trying to start up a 10GB SGA database within two minutes to > "simulate" the start-up of five 2GB databases at boot-up. I sincerely > doubt they are going to start all five databases simultaneously within > two minutes on a regular boot-up. > After bootup, ZFS should have near zero memory in the ARC. Limiting the ARC should have no effect on their startup times. Right ? -r > So, what is the best use of the OS tools (vmstat, etc.) to show them > how this would really occur? > > Rainer > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
I currently run 6 Oracle 9i and 10g dbs using 8GB SGA apiece in containers on a v890 and find no difficulties starting Oracle (though we don''t start all the dbs truly simultaneously). The ARC cache doesn''t ramp up until a lot of IO has passed through after a reboot (typically a steady rise over 4-8 hours). I use the mdb -k method at startup before the containers come up and the zfs datasets are available, with mixed success in being able to hold down kernel arc memory over time but with a definite extension of free memory.. Setting ARC cache limits helps for awhile after a system has been up and is slowly gaining cache but, generally speaking, Oracle runs well even with a low free and high arc cache and that ARC does dump memory reasonably well (though with added overhead) in S10U3, allowing me to start a single instance of Oracle with an 8GB SGA on a machine showing only 4GB free (according to top). And then another. And then another. This message posted from opensolaris.org
Robert Milkowski
2007-Mar-19 16:33 UTC
[zfs-discuss] Re: Re: Re: ZFS memory and swap usage
Hello Rainer, Monday, March 19, 2007, 4:50:59 PM, you wrote: RH> The updated information states that the kernel setting is only RH> for the current Nevada build. We are not going to use the kernel RH> debugger method to change the setting on a live production system RH> (and do this everytime we need to reboot). All you need is to run a small script which using mdb will set it up. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Rainer Heilke
2007-Mar-19 16:46 UTC
[zfs-discuss] Re: Re: Re: Re: ZFS memory and swap usage
> After bootup, ZFS should have near zero memory in the > ARC.This makes sense, and I have no idea how long the server has been running before the test. We can use the above information to help manage their expectations; on boot-up, ARC will be low, so the de-allocation of resources won''t be an issue.> Limiting the ARC should have no effect on their > startup times. Right ?I would think so as well. Thanks. Rainer This message posted from opensolaris.org
Thanks. Like above, knowing the ARC takes time to ramp up strongly suggests that it won''t be an issue on a normally booting system. It sounds like your needs are much greater, and that your databases are running fine. I can take this information to the DBA''s and use it to "manage their expectations". I think the hard part will be getting them to accept that the ARC will go down to accomodate Oracle''s needs, but it may not go down fast enough to accomodate starting up a database five times larger than anything we would have on this server in under 120 seconds after the server has been runing for a few days/weeks. Thanks. :-) Rainer This message posted from opensolaris.org
Jason J. W. Williams
2007-Mar-19 17:02 UTC
[zfs-discuss] Re: Re: Re: ZFS memory and swap usage
Hi Rainer, While I would recommend upgrading to Build 54 or newer to use the system tunable, its not that big of a deal to set the ARC on boot up. We''ve done it on a T2000 for awhile, until we could take it down for an extended period of time to upgrade it. Definitely WOULD NOT run a database on ZFS without it. You will run out of RAM, and depending on how your DB responds to being out of RAM, you could get some very undesirable results. Just my two cents. -J On 3/19/07, Rainer Heilke <rheilke at dragonhearth.com> wrote:> The updated information states that the kernel setting is only for the current Nevada build. We are not going to use the kernel debugger method to change the setting on a live production system (and do this everytime we need to reboot). > > We''re back to trying to set their expectations more realistically, and using proper tools to measure memory usage. As I stated at the outset, they are trying to start up a 10GB SGA database within two minutes to "simulate" the start-up of five 2GB databases at boot-up. I sincerely doubt they are going to start all five databases simultaneously within two minutes on a regular boot-up. > > So, what is the best use of the OS tools (vmstat, etc.) to show them how this would really occur? > > Rainer > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On 3/19/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:> Hello Rainer, > > Monday, March 19, 2007, 4:50:59 PM, you wrote: > > RH> The updated information states that the kernel setting is only > RH> for the current Nevada build. We are not going to use the kernel > RH> debugger method to change the setting on a live production system > RH> (and do this everytime we need to reboot). > > All you need is to run a small script which using mdb will set it up.I have come up with the following dtrace script. I''m not sure that it is gathering exactly the values that I (or you) would want, I developed mainly as a test to see how much I really wanted to push an RFE to get ARC stats into kstat. #! /usr/sbin/dtrace -Cs #pragma D option quiet BEGIN { printf(" MIN MAX CUR TARGET\n"); printf("------ ------ ------ ------\n"); } profile:::tick-10s { printf("%6d %6d %6d %6d\n", (long long) zfs`arc.c_min / 1024/1024, (long long) zfs`arc.c_max / 1024/1024, (long long) zfs`arc.size / 1024/1024, (long long) zfs`arc.c / 1024/1024); } -- Mike Gerdts http://mgerdts.blogspot.com/
Hi Mike, This already integrated in Nevada: 6510807 ARC statistics should be exported via kstat kstat zfs:0:arcstats module: zfs instance: 0 name: arcstats class: misc c 534457344 c_max 16028893184 c_min 534457344 crtime 6301.4284957 deleted 1149800 demand_data_hits 4514722 demand_data_misses 54810 demand_metadata_hits 289342 demand_metadata_misses 5203 evict_skip 0 hash_chain_max 8 hash_chains 8192 hash_collisions 1243605 hash_elements 53250 hash_elements_max 250443 hits 9929297 mfu_ghost_hits 3917 mfu_hits 2496914 misses 60013 mru_ghost_hits 29072 mru_hits 2596064 mutex_miss 4791 p 210483584 prefetch_data_hits 5125227 prefetch_data_misses 0 prefetch_metadata_hits 6 prefetch_metadata_misses 0 recycle_miss 2338 size 439890944 snaptime 939404.5920782 -r
Rainer Heilke
2007-Mar-20 15:48 UTC
[zfs-discuss] Re: Re: Re: Re: ZFS memory and swap usage
Thanks, I''ll give it a whirl. Rainer This message posted from opensolaris.org
Rainer Heilke
2007-Mar-20 15:54 UTC
[zfs-discuss] Re: Re: Re: Re: ZFS memory and swap usage
We cannot go to an OpenSolaris Nevada build for political as well as support reasons. It''s not an option. We have been running several other systems using Oracle on ZFS without issues. The current problem we have is more about getting the DBA''s to understand how things have changed with Sol10/ZFS. I was hoping for whitepapers or other data on how ZFS (and Solaris 10) used memory, and so on. Going back to my original posts, we see that the 10GB SGA database does start, just not in the 120 seconds they are expecting. I am wanting to explain why, in clear terms DBA''s will understand. Rainer This message posted from opensolaris.org
Did you say what version of Solaris 10 you were using? I had similar problems on Sol10 U2, booting a database. This involved first initializing the data files (a few Gb), then starting the server(s) which tried to allocate a large chunk of shared memory. This failed miserably since ZFS had gobbled up most memory. But with Sol10 U3, this problem had gone away and ARC memory was quickly released. I found vmstat was convenient for a simple overview of available memory during the process. - Bjorn This message posted from opensolaris.org
We''re running Update 3. Note that the DB _does_ come up, just not in the two minutes they were expecting. If they wait a few moments after their two-minute start-up attempt, it comes up just fine. I was looking at vmstat, and it seems to tell me what I need. It''s just that I need to present the data as simplistically as possible for the DBA''s. Rainer This message posted from opensolaris.org
On Wed, 21 Mar 2007, Rainer Heilke wrote: [... reformatted ....]> We''re running Update 3. Note that the DB _does_ come up, just not in the > two minutes they were expecting. If they wait a few moments after their > two-minute start-up attempt, it comes up just fine.So why don''t you state the actual time it takes to "come up"? What is "magic" about 2 minutes?> I was looking at vmstat, and it seems to tell me what I need. It''s just > that I need to present the data as simplistically as possible for the > DBA''s. > > RainerRegards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
> So why don''t you state the actual time it takes to "come up"?I can''t because I don''t know. The DBA''s have been very difficult about sharing the information. It took several emails and a meeting before we even found out the fact that the 10GB SGA DB didn''t start up "quick enough". We also honestly don''t know how long they waited before the next attempt where the database came up successfully.> What is "magic" about 2 minutes?This could be a number they picked out of the air, but it may actually be a legitimate timing Oracle uses to determine whether it is coming up cleanly or not--I have no idea. Since I provided a summary of the information I got here together with what I already knew, my offer to sit down with them and redo the testing has generated a big silence. Like I stated before, I think this is mostly an exercise in training and managing expectations, not a real "problem". I''ve been far too busy while at work to push the matter past their current silence. (This account is my personal involvement with OpenSolaris, and I''m answering a lot of these responses from home.) I posted my original question at the request of my Team Lead in the hopes of filling in and clarifying some of the information I''ve read. The comments in this thread have been helpful in doing that. Thanks again to everyone. Rainer This message posted from opensolaris.org