Hello, Is there a way to set a max memory utilization for ZFS? We''re trying to debug an issue where the ZFS is sucking all the RAM out of the box, and its crashing MySQL as a result we think. Will ZFS reduce its cache size if it feels memory pressure? Any help is greatly appreciated. Best Regards, Jason
Jason, There is no documented way of limiting the memory consumption. The ARC section of ZFS tries to adapt to the memory pressure of the system. However, in your case probably it is not quick enough I guess. One way of limiting the memory consumption would be limit the arc.c_max This (arc.c_max) is set to 3/4 of the memory available (or 1GB less than memory available). This is done when the ZFS is loaded (arc_init()). You should be able to change the value of arc.c_max through mdb and set it to the value you want. Exercise caution while setting it. Make sure you don''t have active zpools during this operation. Thanks and regards, Sanjeev. Jason J. W. Williams wrote:> Hello, > > Is there a way to set a max memory utilization for ZFS? We''re trying > to debug an issue where the ZFS is sucking all the RAM out of the box, > and its crashing MySQL as a result we think. Will ZFS reduce its cache > size if it feels memory pressure? Any help is greatly appreciated. > > Best Regards, > Jason > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel: x27521 +91 80 669 27521
HI Sanjeev, Thank you very much! I''m not very familiar with using mdb. Is there anything to be aware of besides no active zpools? Also, which takes precedence 3/4 of the memory or 1GB? Thank you in advance! Your help is greatly appreciated. Best Regards, Jason On 1/7/07, Sanjeev Bagewadi <Sanjeev.Bagewadi at sun.com> wrote:> Jason, > > There is no documented way of limiting the memory consumption. > The ARC section of ZFS tries to adapt to the memory pressure of the system. > However, in your case probably it is not quick enough I guess. > > One way of limiting the memory consumption would be limit the arc.c_max > This (arc.c_max) is set to 3/4 of the memory available (or 1GB less than > memory available). > This is done when the ZFS is loaded (arc_init()). > > You should be able to change the value of arc.c_max through mdb and set > it to the value > you want. Exercise caution while setting it. Make sure you don''t have > active zpools during this operation. > > Thanks and regards, > Sanjeev. > > Jason J. W. Williams wrote: > > > Hello, > > > > Is there a way to set a max memory utilization for ZFS? We''re trying > > to debug an issue where the ZFS is sucking all the RAM out of the box, > > and its crashing MySQL as a result we think. Will ZFS reduce its cache > > size if it feels memory pressure? Any help is greatly appreciated. > > > > Best Regards, > > Jason > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > -- > Solaris Revenue Products Engineering, > India Engineering Center, > Sun Microsystems India Pvt Ltd. > Tel: x27521 +91 80 669 27521 > >
Sanjeev, Could you point me in the right direction as to how to convert the following GCC compile flags to Studio 11 compile flags? Any help is greatly appreciated. We''re trying to recompile MySQL to give a stacktrace and core file to track down exactly why its crashing...hopefully it will illuminate if memory truly is the issue. Thank you very much in advance! -felide-constructors -fno-exceptions -fno-rtti Best Regards, Jason On 1/7/07, Sanjeev Bagewadi <Sanjeev.Bagewadi at sun.com> wrote:> Jason, > > There is no documented way of limiting the memory consumption. > The ARC section of ZFS tries to adapt to the memory pressure of the system. > However, in your case probably it is not quick enough I guess. > > One way of limiting the memory consumption would be limit the arc.c_max > This (arc.c_max) is set to 3/4 of the memory available (or 1GB less than > memory available). > This is done when the ZFS is loaded (arc_init()). > > You should be able to change the value of arc.c_max through mdb and set > it to the value > you want. Exercise caution while setting it. Make sure you don''t have > active zpools during this operation. > > Thanks and regards, > Sanjeev. > > Jason J. W. Williams wrote: > > > Hello, > > > > Is there a way to set a max memory utilization for ZFS? We''re trying > > to debug an issue where the ZFS is sucking all the RAM out of the box, > > and its crashing MySQL as a result we think. Will ZFS reduce its cache > > size if it feels memory pressure? Any help is greatly appreciated. > > > > Best Regards, > > Jason > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > -- > Solaris Revenue Products Engineering, > India Engineering Center, > Sun Microsystems India Pvt Ltd. > Tel: x27521 +91 80 669 27521 > >
On 8-Jan-07, at 11:54 AM, Jason J. W. Williams wrote:> ...We''re trying to recompile MySQL to give a > stacktrace and core file to track down exactly why its > crashing...hopefully it will illuminate if memory truly is the issue.If you''re using the Enterprise release, can''t you get MySQL''s assistance with this? --Toby
We''re not using the Enterprise release, but we are working with them. It looks like MySQL is crashing due to lack of memory. -J On 1/8/07, Toby Thain <toby at smartgames.ca> wrote:> > On 8-Jan-07, at 11:54 AM, Jason J. W. Williams wrote: > > > ...We''re trying to recompile MySQL to give a > > stacktrace and core file to track down exactly why its > > crashing...hopefully it will illuminate if memory truly is the issue. > > If you''re using the Enterprise release, can''t you get MySQL''s > assistance with this? > > --Toby > >
Jason, Jason J. W. Williams wrote:> Thank you very much! I''m not very familiar with using mdb. Is there > anything to be aware of besides no active zpools?you can do the following as root user : -- snip -- # mdb -kw > arc::print -a "struct arc" c_max ffffffffc009a538 c_max = 0x2f9aa800 > ffffffffc009a538/W 0x20000000 arc+0x48: 0x2f9aa800 = 0x20000000 > -- snip -- Here I have modified the value of c_max from 0x2f9aa800 to 0x20000000.> > Also, which takes precedence 3/4 of the memory or 1GB? Thank you in > advance! Your help is greatly appreciated.Whichever is higher. Look at the routine arc_init() [Line 2544] at : http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c And I think your request has been answered :-) Looking at the source I see that they have introduced two new variables zfs_arc_max and zfs_arc_min which seem to be tunables ! There is a detailed explaination at : http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6505658 Thanks and regards, Sanjeev.> > Best Regards, > Jason > > On 1/7/07, Sanjeev Bagewadi <Sanjeev.Bagewadi at sun.com> wrote: > >> Jason, >> >> There is no documented way of limiting the memory consumption. >> The ARC section of ZFS tries to adapt to the memory pressure of the >> system. >> However, in your case probably it is not quick enough I guess. >> >> One way of limiting the memory consumption would be limit the arc.c_max >> This (arc.c_max) is set to 3/4 of the memory available (or 1GB less than >> memory available). >> This is done when the ZFS is loaded (arc_init()). >> >> You should be able to change the value of arc.c_max through mdb and set >> it to the value >> you want. Exercise caution while setting it. Make sure you don''t have >> active zpools during this operation. >> >> Thanks and regards, >> Sanjeev. >> >> Jason J. W. Williams wrote: >> >> > Hello, >> > >> > Is there a way to set a max memory utilization for ZFS? We''re trying >> > to debug an issue where the ZFS is sucking all the RAM out of the box, >> > and its crashing MySQL as a result we think. Will ZFS reduce its cache >> > size if it feels memory pressure? Any help is greatly appreciated. >> > >> > Best Regards, >> > Jason >> > _______________________________________________ >> > zfs-discuss mailing list >> > zfs-discuss at opensolaris.org >> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> >> >> -- >> Solaris Revenue Products Engineering, >> India Engineering Center, >> Sun Microsystems India Pvt Ltd. >> Tel: x27521 +91 80 669 27521 >> >> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel: x27521 +91 80 669 27521
Hi Sanjeev, Thank you! I was not able to find anything as useful on the subject as that! We are running build 54 on an X4500, would I be correct in my reading of that article that if I put "set zfs:zfs_arc_max 0x100000000 #4GB" in my /etc/system, ZFS will consume no more than 4GB? Thank you in advance. Best Regards, Jason On 1/8/07, Sanjeev Bagewadi <Sanjeev.Bagewadi at sun.com> wrote:> Jason, > > Jason J. W. Williams wrote: > > > Thank you very much! I''m not very familiar with using mdb. Is there > > anything to be aware of besides no active zpools? > > you can do the following as root user : > -- snip -- > # mdb -kw > > arc::print -a "struct arc" c_max > ffffffffc009a538 c_max = 0x2f9aa800 > > ffffffffc009a538/W 0x20000000 > arc+0x48: 0x2f9aa800 = 0x20000000 > > > -- snip -- > Here I have modified the value of c_max from 0x2f9aa800 to 0x20000000. > > > > > Also, which takes precedence 3/4 of the memory or 1GB? Thank you in > > advance! Your help is greatly appreciated. > > Whichever is higher. Look at the routine arc_init() [Line 2544] at : > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c > > And I think your request has been answered :-) > Looking at the source I see that they have introduced two new variables > zfs_arc_max and zfs_arc_min which seem to be tunables ! > > There is a detailed explaination at : > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6505658 > > Thanks and regards, > Sanjeev. > > > > > Best Regards, > > Jason > > > > On 1/7/07, Sanjeev Bagewadi <Sanjeev.Bagewadi at sun.com> wrote: > > > >> Jason, > >> > >> There is no documented way of limiting the memory consumption. > >> The ARC section of ZFS tries to adapt to the memory pressure of the > >> system. > >> However, in your case probably it is not quick enough I guess. > >> > >> One way of limiting the memory consumption would be limit the arc.c_max > >> This (arc.c_max) is set to 3/4 of the memory available (or 1GB less than > >> memory available). > >> This is done when the ZFS is loaded (arc_init()). > >> > >> You should be able to change the value of arc.c_max through mdb and set > >> it to the value > >> you want. Exercise caution while setting it. Make sure you don''t have > >> active zpools during this operation. > >> > >> Thanks and regards, > >> Sanjeev. > >> > >> Jason J. W. Williams wrote: > >> > >> > Hello, > >> > > >> > Is there a way to set a max memory utilization for ZFS? We''re trying > >> > to debug an issue where the ZFS is sucking all the RAM out of the box, > >> > and its crashing MySQL as a result we think. Will ZFS reduce its cache > >> > size if it feels memory pressure? Any help is greatly appreciated. > >> > > >> > Best Regards, > >> > Jason > >> > _______________________________________________ > >> > zfs-discuss mailing list > >> > zfs-discuss at opensolaris.org > >> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >> > >> > >> > >> -- > >> Solaris Revenue Products Engineering, > >> India Engineering Center, > >> Sun Microsystems India Pvt Ltd. > >> Tel: x27521 +91 80 669 27521 > >> > >> > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > -- > Solaris Revenue Products Engineering, > India Engineering Center, > Sun Microsystems India Pvt Ltd. > Tel: x27521 +91 80 669 27521 > >
Jason, Apologies.. I missed out this mail yesterday... I am not too familiar with the options. Someoen else will have to answer this. Thanks and regards, Sanjeev. Jason J. W. Williams wrote:> Sanjeev, > > Could you point me in the right direction as to how to convert the > following GCC compile flags to Studio 11 compile flags? Any help is > greatly appreciated. We''re trying to recompile MySQL to give a > stacktrace and core file to track down exactly why its > crashing...hopefully it will illuminate if memory truly is the issue. > Thank you very much in advance! > > -felide-constructors > -fno-exceptions -fno-rtti > > Best Regards, > Jason > > On 1/7/07, Sanjeev Bagewadi <Sanjeev.Bagewadi at sun.com> wrote: > >> Jason, >> >> There is no documented way of limiting the memory consumption. >> The ARC section of ZFS tries to adapt to the memory pressure of the >> system. >> However, in your case probably it is not quick enough I guess. >> >> One way of limiting the memory consumption would be limit the arc.c_max >> This (arc.c_max) is set to 3/4 of the memory available (or 1GB less than >> memory available). >> This is done when the ZFS is loaded (arc_init()). >> >> You should be able to change the value of arc.c_max through mdb and set >> it to the value >> you want. Exercise caution while setting it. Make sure you don''t have >> active zpools during this operation. >> >> Thanks and regards, >> Sanjeev. >> >> Jason J. W. Williams wrote: >> >> > Hello, >> > >> > Is there a way to set a max memory utilization for ZFS? We''re trying >> > to debug an issue where the ZFS is sucking all the RAM out of the box, >> > and its crashing MySQL as a result we think. Will ZFS reduce its cache >> > size if it feels memory pressure? Any help is greatly appreciated. >> > >> > Best Regards, >> > Jason >> > _______________________________________________ >> > zfs-discuss mailing list >> > zfs-discuss at opensolaris.org >> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> >> >> -- >> Solaris Revenue Products Engineering, >> India Engineering Center, >> Sun Microsystems India Pvt Ltd. >> Tel: x27521 +91 80 669 27521 >> >> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel: x27521 +91 80 669 27521
Hi Jason, Depending on which hardware architecture you''re working on, you may be able to get Studio 11 compiled binaries through the CoolStack project: http://cooltools.sunsource.net/coolstack/index.html Regardless, optimized compiler flags for MySQL with Studio 11 are in the source bundle listed there. If you need additional help, let me know. Separately, it''s my understanding that ZFS reduces it''s memory usage as Solaris needs to allocate more memory for applications. I''ve not seen this problem but I suspect it''d be better to try to come up with a more simple test case that mimics MySQL (i.e. mmap() or malloc through whichever memory library MySQL is using). I suspect it''s mmap or DISM, but if it''s a *alloc problem, you may want to look at the man page for umem_debug, as that may be able to help you find where the problem is coming from. In fact, CoolStack may be a good tested, stable build for you to use alongside ZFS. You can email me directly with any issues you run into with it and I''ll get it into the right group of people. Hope that helps, - Matt Sanjeev Bagewadi wrote:> Jason, > > Apologies.. I missed out this mail yesterday... > > I am not too familiar with the options. Someoen else will have to > answer this. > > Thanks and regards, > Sanjeev. > > Jason J. W. Williams wrote: > >> Sanjeev, >> >> Could you point me in the right direction as to how to convert the >> following GCC compile flags to Studio 11 compile flags? Any help is >> greatly appreciated. We''re trying to recompile MySQL to give a >> stacktrace and core file to track down exactly why its >> crashing...hopefully it will illuminate if memory truly is the issue. >> Thank you very much in advance! >> >> -felide-constructors >> -fno-exceptions -fno-rtti >> >> Best Regards, >> Jason >> >> On 1/7/07, Sanjeev Bagewadi <Sanjeev.Bagewadi at sun.com> wrote: >> >>> Jason, >>> >>> There is no documented way of limiting the memory consumption. >>> The ARC section of ZFS tries to adapt to the memory pressure of the >>> system. >>> However, in your case probably it is not quick enough I guess. >>> >>> One way of limiting the memory consumption would be limit the arc.c_max >>> This (arc.c_max) is set to 3/4 of the memory available (or 1GB less >>> than >>> memory available). >>> This is done when the ZFS is loaded (arc_init()). >>> >>> You should be able to change the value of arc.c_max through mdb and set >>> it to the value >>> you want. Exercise caution while setting it. Make sure you don''t have >>> active zpools during this operation. >>> >>> Thanks and regards, >>> Sanjeev. >>> >>> Jason J. W. Williams wrote: >>> >>> > Hello, >>> > >>> > Is there a way to set a max memory utilization for ZFS? We''re trying >>> > to debug an issue where the ZFS is sucking all the RAM out of the >>> box, >>> > and its crashing MySQL as a result we think. Will ZFS reduce its >>> cache >>> > size if it feels memory pressure? Any help is greatly appreciated. >>> > >>> > Best Regards, >>> > Jason >>> > _______________________________________________ >>> > zfs-discuss mailing list >>> > zfs-discuss at opensolaris.org >>> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>> >>> >>> >>> -- >>> Solaris Revenue Products Engineering, >>> India Engineering Center, >>> Sun Microsystems India Pvt Ltd. >>> Tel: x27521 +91 80 669 27521 >>> >>> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > >-- Matt Ingenthron - Web Infrastructure Solutions Architect Sun Microsystems, Inc. - Client Solutions, Systems Practice http://blogs.sun.com/mingenthron/ email: matt.ingenthron at sun.com Phone: 310-242-6439
Hello Jason, Tuesday, January 9, 2007, 10:28:12 PM, you wrote: JJWW> Hi Sanjeev, JJWW> Thank you! I was not able to find anything as useful on the subject as JJWW> that! We are running build 54 on an X4500, would I be correct in my JJWW> reading of that article that if I put "set zfs:zfs_arc_max JJWW> 0x100000000 #4GB" in my /etc/system, ZFS will consume no more than JJWW> 4GB? Thank you in advance. That''s the idea however it''s not working that way now - under some circumstances ZFS could still consume much more memory - see other posts lately here. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Jason, Robert is right... The point is ARC is the caching module of ZFS and majority of the memory is consumed through ARC. Hence by limiting the c_max of ARC we are limiting the amount ARC consumes. However, other modules of ZFS would consume more but that may not be as significant as ARC. Expert, please correct me if I am wrong here. Thanks and regards, Sanjeev. Robert Milkowski wrote:>Hello Jason, > >Tuesday, January 9, 2007, 10:28:12 PM, you wrote: > >JJWW> Hi Sanjeev, > >JJWW> Thank you! I was not able to find anything as useful on the subject as >JJWW> that! We are running build 54 on an X4500, would I be correct in my >JJWW> reading of that article that if I put "set zfs:zfs_arc_max >JJWW> 0x100000000 #4GB" in my /etc/system, ZFS will consume no more than >JJWW> 4GB? Thank you in advance. > >That''s the idea however it''s not working that way now - under some >circumstances ZFS could still consume much more memory - see other >posts lately here. > > >-- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel: x27521 +91 80 669 27521
Sanjeev & Robert, Thanks guys. We put that in place last night and it seems to be doing a lot better job of consuming less RAM. We set it to 4GB and each of our 2 MySQL instances on the box to a max of 4GB. So hopefully slush of 4GB on the Thumper is enough. I would be interested in what the other ZFS modules memory behaviors are. I''ll take a perusal through the archives. In general it seems to me that a max cap for ZFS whether set through a series of individual tunables or a single root tunable would be very helpful. Best Regards, Jason On 1/10/07, Sanjeev Bagewadi <Sanjeev.Bagewadi at sun.com> wrote:> Jason, > > Robert is right... > > The point is ARC is the caching module of ZFS and majority of the memory > is consumed through ARC. > Hence by limiting the c_max of ARC we are limiting the amount ARC consumes. > > However, other modules of ZFS would consume more but that may not be as > significant as ARC. > > Expert, please correct me if I am wrong here. > > Thanks and regards, > Sanjeev. > > Robert Milkowski wrote: > > >Hello Jason, > > > >Tuesday, January 9, 2007, 10:28:12 PM, you wrote: > > > >JJWW> Hi Sanjeev, > > > >JJWW> Thank you! I was not able to find anything as useful on the subject as > >JJWW> that! We are running build 54 on an X4500, would I be correct in my > >JJWW> reading of that article that if I put "set zfs:zfs_arc_max > >JJWW> 0x100000000 #4GB" in my /etc/system, ZFS will consume no more than > >JJWW> 4GB? Thank you in advance. > > > >That''s the idea however it''s not working that way now - under some > >circumstances ZFS could still consume much more memory - see other > >posts lately here. > > > > > > > > > -- > Solaris Revenue Products Engineering, > India Engineering Center, > Sun Microsystems India Pvt Ltd. > Tel: x27521 +91 80 669 27521 > >
Hi Guys, After reading through the discussion on this regarding ZFS memory fragmentation on snv_53 (and forward) and going through our ::kmastat...looks like ZFS is sucking down about 544 MB of RAM in the various caches. About 360MB of that is in the zio_buf_65536 cache. Next most notable is 55MB in zio_buf_32768, and 36MB in zio_buf_16384. I don''t think that''s too bad but worth keeping track of. At this point our kernel memory growth seems to have slowed, with it hovering around 5GB, and the anon column is mostly what''s growing now (as expected...MySQL). Most of the problem in the discussion thread on this seemed to be related to a lot of DLNC entries due to the workload of a file server. How would this affect a database server with operations in only a couple very large files? Thank you in advance. Best Regards, Jason On 1/10/07, Jason J. W. Williams <jasonjwwilliams at gmail.com> wrote:> Sanjeev & Robert, > > Thanks guys. We put that in place last night and it seems to be doing > a lot better job of consuming less RAM. We set it to 4GB and each of > our 2 MySQL instances on the box to a max of 4GB. So hopefully slush > of 4GB on the Thumper is enough. I would be interested in what the > other ZFS modules memory behaviors are. I''ll take a perusal through > the archives. In general it seems to me that a max cap for ZFS whether > set through a series of individual tunables or a single root tunable > would be very helpful. > > Best Regards, > Jason > > On 1/10/07, Sanjeev Bagewadi <Sanjeev.Bagewadi at sun.com> wrote: > > Jason, > > > > Robert is right... > > > > The point is ARC is the caching module of ZFS and majority of the memory > > is consumed through ARC. > > Hence by limiting the c_max of ARC we are limiting the amount ARC consumes. > > > > However, other modules of ZFS would consume more but that may not be as > > significant as ARC. > > > > Expert, please correct me if I am wrong here. > > > > Thanks and regards, > > Sanjeev. > > > > Robert Milkowski wrote: > > > > >Hello Jason, > > > > > >Tuesday, January 9, 2007, 10:28:12 PM, you wrote: > > > > > >JJWW> Hi Sanjeev, > > > > > >JJWW> Thank you! I was not able to find anything as useful on the subject as > > >JJWW> that! We are running build 54 on an X4500, would I be correct in my > > >JJWW> reading of that article that if I put "set zfs:zfs_arc_max > > >JJWW> 0x100000000 #4GB" in my /etc/system, ZFS will consume no more than > > >JJWW> 4GB? Thank you in advance. > > > > > >That''s the idea however it''s not working that way now - under some > > >circumstances ZFS could still consume much more memory - see other > > >posts lately here. > > > > > > > > > > > > > > > -- > > Solaris Revenue Products Engineering, > > India Engineering Center, > > Sun Microsystems India Pvt Ltd. > > Tel: x27521 +91 80 669 27521 > > > > >
Hello Jason, Wednesday, January 10, 2007, 9:45:05 PM, you wrote: JJWW> Sanjeev & Robert, JJWW> Thanks guys. We put that in place last night and it seems to be doing JJWW> a lot better job of consuming less RAM. We set it to 4GB and each of JJWW> our 2 MySQL instances on the box to a max of 4GB. So hopefully slush JJWW> of 4GB on the Thumper is enough. I would be interested in what the JJWW> other ZFS modules memory behaviors are. I''ll take a perusal through JJWW> the archives. In general it seems to me that a max cap for ZFS whether JJWW> set through a series of individual tunables or a single root tunable JJWW> would be very helpful. Yes it would. Better yet would be if memory consumed by ZFS for caching (dnodes, vnodes, data, ...) would behave similar to page cache like with UFS so applications will be able to get back almost all memory used for ZFS caches if needed. I guess (and it''s really a guess only based on some emails here) that in worst case scenario ZFS caches would consume about: arc_max + 3*arc_max + memory lost for fragmentation So I guess with arc_max set to 1GB you can lost even 5GB (or more) and currently only that first 1GB can be get back automatically. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hi Robert, Thank you! Holy mackerel! That''s a lot of memory. With that type of a calculation my 4GB arc_max setting is still in the danger zone on a Thumper. I wonder if any of the ZFS developers could shed some light on the calculation? That kind of memory loss makes ZFS almost unusable for a database system. I agree that a page cache similar to UFS would be much better. Linux works similarly to free pages, and it has been effective enough in the past. Though I''m equally unhappy about Linux''s tendency to grab every bit of free RAM available for filesystem caching, and then cause massive memory thrashing as it frees it for applications. Best Regards, Jason On 1/10/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:> Hello Jason, > > Wednesday, January 10, 2007, 9:45:05 PM, you wrote: > > JJWW> Sanjeev & Robert, > > JJWW> Thanks guys. We put that in place last night and it seems to be doing > JJWW> a lot better job of consuming less RAM. We set it to 4GB and each of > JJWW> our 2 MySQL instances on the box to a max of 4GB. So hopefully slush > JJWW> of 4GB on the Thumper is enough. I would be interested in what the > JJWW> other ZFS modules memory behaviors are. I''ll take a perusal through > JJWW> the archives. In general it seems to me that a max cap for ZFS whether > JJWW> set through a series of individual tunables or a single root tunable > JJWW> would be very helpful. > > Yes it would. Better yet would be if memory consumed by ZFS for > caching (dnodes, vnodes, data, ...) would behave similar to page cache > like with UFS so applications will be able to get back almost all > memory used for ZFS caches if needed. > > I guess (and it''s really a guess only based on some emails here) that > in worst case scenario ZFS caches would consume about: > > arc_max + 3*arc_max + memory lost for fragmentation > > So I guess with arc_max set to 1GB you can lost even 5GB (or more) and > currently only that first 1GB can be get back automatically. > > > -- > Best regards, > Robert mailto:rmilkowski at task.gda.pl > http://milek.blogspot.com > >
johansen-osdev at sun.com
2007-Jan-10 23:45 UTC
[zfs-discuss] Limit ZFS Memory Utilization
Robert:> Better yet would be if memory consumed by ZFS for caching (dnodes, > vnodes, data, ...) would behave similar to page cache like with UFS so > applications will be able to get back almost all memory used for ZFS > caches if needed.I believe that a better response to memory pressure is a long-term goal for ZFS. There''s also an effort in progress to improve the caching algorithms used in the ARC. -j
Hello Jason, Thursday, January 11, 2007, 12:36:46 AM, you wrote: JJWW> Hi Robert, JJWW> Thank you! Holy mackerel! That''s a lot of memory. With that type of a JJWW> calculation my 4GB arc_max setting is still in the danger zone on a JJWW> Thumper. I wonder if any of the ZFS developers could shed some light JJWW> on the calculation? JJWW> That kind of memory loss makes ZFS almost unusable for a database system. If you leave ncsize with default value then I belive it won''t consume that much memory. JJWW> I agree that a page cache similar to UFS would be much better. Linux JJWW> works similarly to free pages, and it has been effective enough in the JJWW> past. Though I''m equally unhappy about Linux''s tendency to grab every JJWW> bit of free RAM available for filesystem caching, and then cause JJWW> massive memory thrashing as it frees it for applications. Page cache won''t be better - just better memory control for ZFS caches is strongly desired. Unfortunately from time to time ZFS makes servers to page enormously :( -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hi Robert, We''ve got the default ncsize. I didn''t see any advantage to increasing it outside of NFS serving...which this server is not. For speed the X4500 is showing to be a killer MySQL platform. Between the blazing fast procs and the sheer number of spindles, its perfromance is tremendous. If MySQL cluster had full disk-based support, scale-out with X4500s a-la Greenplum would be terrific solution. At this point, the ZFS memory gobbling is the main roadblock to being a good database platform. Regarding the paging activity, we too saw tremendous paging of up to 24% of the X4500s CPU being used for that with the default arc_max. After changing it to 4GB, we haven''t seen anything much over 5-10%. Best Regards, Jason On 1/10/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:> Hello Jason, > > Thursday, January 11, 2007, 12:36:46 AM, you wrote: > > JJWW> Hi Robert, > > JJWW> Thank you! Holy mackerel! That''s a lot of memory. With that type of a > JJWW> calculation my 4GB arc_max setting is still in the danger zone on a > JJWW> Thumper. I wonder if any of the ZFS developers could shed some light > JJWW> on the calculation? > > JJWW> That kind of memory loss makes ZFS almost unusable for a database system. > > > If you leave ncsize with default value then I belive it won''t consume > that much memory. > > > JJWW> I agree that a page cache similar to UFS would be much better. Linux > JJWW> works similarly to free pages, and it has been effective enough in the > JJWW> past. Though I''m equally unhappy about Linux''s tendency to grab every > JJWW> bit of free RAM available for filesystem caching, and then cause > JJWW> massive memory thrashing as it frees it for applications. > > Page cache won''t be better - just better memory control for ZFS caches > is strongly desired. Unfortunately from time to time ZFS makes servers > to page enormously :( > > > -- > Best regards, > Robert mailto:rmilkowski at task.gda.pl > http://milek.blogspot.com > >
Hello Jason, Thursday, January 11, 2007, 1:10:10 AM, you wrote: JJWW> Hi Robert, JJWW> We''ve got the default ncsize. I didn''t see any advantage to increasing JJWW> it outside of NFS serving...which this server is not. For speed the JJWW> X4500 is showing to be a killer MySQL platform. Between the blazing JJWW> fast procs and the sheer number of spindles, its perfromance is Have you got any numbers you can share? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Jason J. W. Williams wrote:> Hi Robert, > > Thank you! Holy mackerel! That''s a lot of memory. With that type of a > calculation my 4GB arc_max setting is still in the danger zone on a > Thumper. I wonder if any of the ZFS developers could shed some light > on the calculation? >In a worst-case scenario, Robert''s calculations are accurate to a certain degree: If you have 1GB of dnode_phys data in your arc cache (that would be about 1,200,000 files referenced), then this will result in another 3GB of "related" data held in memory: vnodes/znodes/ dnodes/etc. This related data is the in-core data associated with an accessed file. Its not quite true that this data is not evictable, it *is* evictable, but the space is returned from these kmem caches only after the arc has cleared its blocks and triggered the "free" of the related data structures (and even then, the kernel will need to to a kmem_reap to reclaim the memory from the caches). The fragmentation that Robert mentions is an issue because, if we don''t free everything, the kmem_reap may not be able to reclaim all the memory from these caches, as they are allocated in "slabs". We are in the process of trying to improve this situation.> That kind of memory loss makes ZFS almost unusable for a database system. >Note that you are not going to experience these sorts of overheads unless you are accessing *many* files. In a database system, there are only going to be a few files => no significant overhead.> I agree that a page cache similar to UFS would be much better. Linux > works similarly to free pages, and it has been effective enough in the > past. Though I''m equally unhappy about Linux''s tendency to grab every > bit of free RAM available for filesystem caching, and then cause > massive memory thrashing as it frees it for applications. >The page cache is "much better" in the respect that it is more tightly integrated with the VM system, so you get more efficient response to memory pressure. It is *much worse* than the ARC at caching data for a file system. In the long-term we plan to integrate the ARC into the Solaris VM system.> Best Regards, > Jason > > On 1/10/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote: >> Hello Jason, >> >> Wednesday, January 10, 2007, 9:45:05 PM, you wrote: >> >> JJWW> Sanjeev & Robert, >> >> JJWW> Thanks guys. We put that in place last night and it seems to be >> doing >> JJWW> a lot better job of consuming less RAM. We set it to 4GB and >> each of >> JJWW> our 2 MySQL instances on the box to a max of 4GB. So hopefully >> slush >> JJWW> of 4GB on the Thumper is enough. I would be interested in what the >> JJWW> other ZFS modules memory behaviors are. I''ll take a perusal through >> JJWW> the archives. In general it seems to me that a max cap for ZFS >> whether >> JJWW> set through a series of individual tunables or a single root >> tunable >> JJWW> would be very helpful. >> >> Yes it would. Better yet would be if memory consumed by ZFS for >> caching (dnodes, vnodes, data, ...) would behave similar to page cache >> like with UFS so applications will be able to get back almost all >> memory used for ZFS caches if needed. >> >> I guess (and it''s really a guess only based on some emails here) that >> in worst case scenario ZFS caches would consume about: >> >> arc_max + 3*arc_max + memory lost for fragmentation >> >> So I guess with arc_max set to 1GB you can lost even 5GB (or more) and >> currently only that first 1GB can be get back automatically. >> >> >> -- >> Best regards, >> Robert mailto:rmilkowski at task.gda.pl >> http://milek.blogspot.com >> >> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hey guys, Do to loooong URL lookups, the DNLC was pushed to variable sized entries. The hit rate was dropping because of "name to long" misses. This was done long ago while I was at Sun under a bug reported by me.. I don''t know your usage, but you should attempt to estimate the amount of mem used with the default size. Yes, this is after you start tracking your DNLC hit rate and make sure it doesn''t significantly drop if the ncsize is decreased. You also may wish to increase the size and again check the hit rate.. Yes, it is posible that your access is random enough that no changes will effect the hit rte. 2nd item.. Bonwick''s mem allcators I think still have the ability to limit the size of each slab. The issue is that some parts of the code expect non mem failures with SLEEPs. This can result in extended SLEEPs, but can be done. If your company generates changes to your local source and then you rebuild, it is possible to pre-allocate a fixed number of objects per cache and then use NOLSLEEPs with returning values that indicate to retry or failure. 3rd.. And could be the most important, the mem cache allocators are lazy in freeing memory when it is not needed by anyone else. Thus, unfreed memory is effectively used as a cache to remove latencies of on-demand memory allocations. This artificially keeps memory usage high, but should have minimal latencies to realloc when necessary. Also, it is possible to make mods to increase the level of mem garbage collection after some watermark code is added to minimize repeated allocs and frees. Mitchell Erblich ---------------- "Jason J. W. Williams" wrote:> > Hi Robert, > > We''ve got the default ncsize. I didn''t see any advantage to increasing > it outside of NFS serving...which this server is not. For speed the > X4500 is showing to be a killer MySQL platform. Between the blazing > fast procs and the sheer number of spindles, its perfromance is > tremendous. If MySQL cluster had full disk-based support, scale-out > with X4500s a-la Greenplum would be terrific solution. > > At this point, the ZFS memory gobbling is the main roadblock to being > a good database platform. > > Regarding the paging activity, we too saw tremendous paging of up to > 24% of the X4500s CPU being used for that with the default arc_max. > After changing it to 4GB, we haven''t seen anything much over 5-10%. > > Best Regards, > Jason > > On 1/10/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote: > > Hello Jason, > > > > Thursday, January 11, 2007, 12:36:46 AM, you wrote: > > > > JJWW> Hi Robert, > > > > JJWW> Thank you! Holy mackerel! That''s a lot of memory. With that type of a > > JJWW> calculation my 4GB arc_max setting is still in the danger zone on a > > JJWW> Thumper. I wonder if any of the ZFS developers could shed some light > > JJWW> on the calculation? > > > > JJWW> That kind of memory loss makes ZFS almost unusable for a database system. > > > > > > If you leave ncsize with default value then I belive it won''t consume > > that much memory. > > > > > > JJWW> I agree that a page cache similar to UFS would be much better. Linux > > JJWW> works similarly to free pages, and it has been effective enough in the > > JJWW> past. Though I''m equally unhappy about Linux''s tendency to grab every > > JJWW> bit of free RAM available for filesystem caching, and then cause > > JJWW> massive memory thrashing as it frees it for applications. > > > > Page cache won''t be better - just better memory control for ZFS caches > > is strongly desired. Unfortunately from time to time ZFS makes servers > > to page enormously :( > > > > > > -- > > Best regards, > > Robert mailto:rmilkowski at task.gda.pl > > http://milek.blogspot.com > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hi Mark, Thank you. That makes a lot of sense. In our case we''re talking around 10 multi-gigabyte files. The arc_max+3*arc_max+fragmentation was a bit worrisome. It sounds then that this is mostly an issue on something like an NFS server which had a ton of small files, where the minimum_file_node_overhead*files was consuming the arc_max*3? On a side-note it appears that most of our zio cache stay pretty static. 99% of the ::memstat kernel memory increases are in zio_buf_65536. It seems to increase between 5-50MB/hr depending on the database update load. Is integrating the ARC into the Solaris VM system a Solaris Nevada goal? Or would that be the next major release after Nevada? Best Regards, Jason On 1/10/07, Mark Maybee <Mark.Maybee at sun.com> wrote:> Jason J. W. Williams wrote: > > Hi Robert, > > > > Thank you! Holy mackerel! That''s a lot of memory. With that type of a > > calculation my 4GB arc_max setting is still in the danger zone on a > > Thumper. I wonder if any of the ZFS developers could shed some light > > on the calculation? > > > In a worst-case scenario, Robert''s calculations are accurate to a > certain degree: If you have 1GB of dnode_phys data in your arc cache > (that would be about 1,200,000 files referenced), then this will result > in another 3GB of "related" data held in memory: vnodes/znodes/ > dnodes/etc. This related data is the in-core data associated with > an accessed file. Its not quite true that this data is not evictable, > it *is* evictable, but the space is returned from these kmem caches > only after the arc has cleared its blocks and triggered the "free" of > the related data structures (and even then, the kernel will need to > to a kmem_reap to reclaim the memory from the caches). The > fragmentation that Robert mentions is an issue because, if we don''t > free everything, the kmem_reap may not be able to reclaim all the > memory from these caches, as they are allocated in "slabs". > > We are in the process of trying to improve this situation. > > > That kind of memory loss makes ZFS almost unusable for a database system. > > > Note that you are not going to experience these sorts of overheads > unless you are accessing *many* files. In a database system, there are > only going to be a few files => no significant overhead. > > > I agree that a page cache similar to UFS would be much better. Linux > > works similarly to free pages, and it has been effective enough in the > > past. Though I''m equally unhappy about Linux''s tendency to grab every > > bit of free RAM available for filesystem caching, and then cause > > massive memory thrashing as it frees it for applications. > > > The page cache is "much better" in the respect that it is more tightly > integrated with the VM system, so you get more efficient response to > memory pressure. It is *much worse* than the ARC at caching data for > a file system. In the long-term we plan to integrate the ARC into the > Solaris VM system. > > > Best Regards, > > Jason > > > > On 1/10/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote: > >> Hello Jason, > >> > >> Wednesday, January 10, 2007, 9:45:05 PM, you wrote: > >> > >> JJWW> Sanjeev & Robert, > >> > >> JJWW> Thanks guys. We put that in place last night and it seems to be > >> doing > >> JJWW> a lot better job of consuming less RAM. We set it to 4GB and > >> each of > >> JJWW> our 2 MySQL instances on the box to a max of 4GB. So hopefully > >> slush > >> JJWW> of 4GB on the Thumper is enough. I would be interested in what the > >> JJWW> other ZFS modules memory behaviors are. I''ll take a perusal through > >> JJWW> the archives. In general it seems to me that a max cap for ZFS > >> whether > >> JJWW> set through a series of individual tunables or a single root > >> tunable > >> JJWW> would be very helpful. > >> > >> Yes it would. Better yet would be if memory consumed by ZFS for > >> caching (dnodes, vnodes, data, ...) would behave similar to page cache > >> like with UFS so applications will be able to get back almost all > >> memory used for ZFS caches if needed. > >> > >> I guess (and it''s really a guess only based on some emails here) that > >> in worst case scenario ZFS caches would consume about: > >> > >> arc_max + 3*arc_max + memory lost for fragmentation > >> > >> So I guess with arc_max set to 1GB you can lost even 5GB (or more) and > >> currently only that first 1GB can be get back automatically. > >> > >> > >> -- > >> Best regards, > >> Robert mailto:rmilkowski at task.gda.pl > >> http://milek.blogspot.com > >> > >> > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On Wed, 10 Jan 2007, Mark Maybee wrote:> Jason J. W. Williams wrote: > > Hi Robert, > > > > Thank you! Holy mackerel! That''s a lot of memory. With that type of a > > calculation my 4GB arc_max setting is still in the danger zone on a > > Thumper. I wonder if any of the ZFS developers could shed some light > > on the calculation? > > > In a worst-case scenario, Robert''s calculations are accurate to a > certain degree: If you have 1GB of dnode_phys data in your arc cache > (that would be about 1,200,000 files referenced), then this will result > in another 3GB of "related" data held in memory: vnodes/znodes/ > dnodes/etc. This related data is the in-core data associated with > an accessed file. Its not quite true that this data is not evictable, > it *is* evictable, but the space is returned from these kmem caches > only after the arc has cleared its blocks and triggered the "free" of > the related data structures (and even then, the kernel will need to > to a kmem_reap to reclaim the memory from the caches). The > fragmentation that Robert mentions is an issue because, if we don''t > free everything, the kmem_reap may not be able to reclaim all the > memory from these caches, as they are allocated in "slabs". > > We are in the process of trying to improve this situation..... snip ..... Understood (and many Thanks). In the meantime, is there a rule-of-thumb that you could share that would allow mere humans (like me) to calculate the best values of zfs:zfs_arc_max and ncsize, given the that machine has nGb of RAM and is used in the following broad workload scenarios: a) a busy NFS server b) a general multiuser development server c) a database server d) an Apache/Tomcat/FTP server e) a single user Gnome desktop running U3 with home dirs on a ZFS filesystem It would seem, from reading between the lines of previous emails, particularly the ones you''ve (Mark M) written, that there is a rule of thumb that would apply given a standard or modified ncsize tunable?? I''m primarily interested in a calculation that would allow settings that would reduce the possibility of the machine descending into "swap hell". PS: Interesting is that no one has mentioned (the tunable) maxpgio. I''ve often found that increasing maxpgio is the only way to improve the odds of a machine remaining usable when lots of swapping is taking place. Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
Hello all, I second Al''s motion. Even a little script a-la the CoolTools for tuning Solaris for the T2000 would be great. -J On 1/10/07, Al Hopper <al at logical-approach.com> wrote:> On Wed, 10 Jan 2007, Mark Maybee wrote: > > > Jason J. W. Williams wrote: > > > Hi Robert, > > > > > > Thank you! Holy mackerel! That''s a lot of memory. With that type of a > > > calculation my 4GB arc_max setting is still in the danger zone on a > > > Thumper. I wonder if any of the ZFS developers could shed some light > > > on the calculation? > > > > > In a worst-case scenario, Robert''s calculations are accurate to a > > certain degree: If you have 1GB of dnode_phys data in your arc cache > > (that would be about 1,200,000 files referenced), then this will result > > in another 3GB of "related" data held in memory: vnodes/znodes/ > > dnodes/etc. This related data is the in-core data associated with > > an accessed file. Its not quite true that this data is not evictable, > > it *is* evictable, but the space is returned from these kmem caches > > only after the arc has cleared its blocks and triggered the "free" of > > the related data structures (and even then, the kernel will need to > > to a kmem_reap to reclaim the memory from the caches). The > > fragmentation that Robert mentions is an issue because, if we don''t > > free everything, the kmem_reap may not be able to reclaim all the > > memory from these caches, as they are allocated in "slabs". > > > > We are in the process of trying to improve this situation. > .... snip ..... > > Understood (and many Thanks). In the meantime, is there a rule-of-thumb > that you could share that would allow mere humans (like me) to calculate > the best values of zfs:zfs_arc_max and ncsize, given the that machine has > nGb of RAM and is used in the following broad workload scenarios: > > a) a busy NFS server > b) a general multiuser development server > c) a database server > d) an Apache/Tomcat/FTP server > e) a single user Gnome desktop running U3 with home dirs on a ZFS > filesystem > > It would seem, from reading between the lines of previous emails, > particularly the ones you''ve (Mark M) written, that there is a rule of > thumb that would apply given a standard or modified ncsize tunable?? > > I''m primarily interested in a calculation that would allow settings that > would reduce the possibility of the machine descending into "swap hell". > > PS: Interesting is that no one has mentioned (the tunable) maxpgio. I''ve > often found that increasing maxpgio is the only way to improve the odds of > a machine remaining usable when lots of swapping is taking place. > > Regards, > > Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com > Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT > OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 > OpenSolaris Governing Board (OGB) Member - Feb 2006 > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Robert, Comments inline... Robert Milkowski wrote:> Hello Jason, > > Wednesday, January 10, 2007, 9:45:05 PM, you wrote: > > JJWW> Sanjeev & Robert, > > JJWW> Thanks guys. We put that in place last night and it seems to be doing > JJWW> a lot better job of consuming less RAM. We set it to 4GB and each of > JJWW> our 2 MySQL instances on the box to a max of 4GB. So hopefully slush > JJWW> of 4GB on the Thumper is enough. I would be interested in what the > JJWW> other ZFS modules memory behaviors are. I''ll take a perusal through > JJWW> the archives. In general it seems to me that a max cap for ZFS whether > JJWW> set through a series of individual tunables or a single root tunable > JJWW> would be very helpful. > > Yes it would. Better yet would be if memory consumed by ZFS for > caching (dnodes, vnodes, data, ...) would behave similar to page cache > like with UFS so applications will be able to get back almost all > memory used for ZFS caches if needed. > > I guess (and it''s really a guess only based on some emails here) that > in worst case scenario ZFS caches would consume about: > > arc_max + 3*arc_max + memory lost for fragmentation >This is not true from what I know :-) How did you get to this number ? From my knowledge it uses : c_max + (some memory for other caches) NOTE : (some memory for other caches) is not as large as c_max. It is probably just x% of it and not multiples of c_max.> So I guess with arc_max set to 1GB you can lost even 5GB (or more) and > currently only that first 1GB can be get back automatically. >This doesn''t seem right based on my knowledge of ZFS. Regards, Sanjeev.> >
Jason, Jason J. W. Williams wrote:> Hi Robert, > > We''ve got the default ncsize. I didn''t see any advantage to increasing > it outside of NFS serving...which this server is not. For speed the > X4500 is showing to be a killer MySQL platform. Between the blazing > fast procs and the sheer number of spindles, its perfromance is > tremendous. If MySQL cluster had full disk-based support, scale-out > with X4500s a-la Greenplum would be terrific solution. > > At this point, the ZFS memory gobbling is the main roadblock to being > a good database platform. > > Regarding the paging activity, we too saw tremendous paging of up to > 24% of the X4500s CPU being used for that with the default arc_max. > After changing it to 4GB, we haven''t seen anything much over 5-10%.Remember that ZFS does not use the standard solaris paging architecture for caching. Instead it uses ARC for all its caching. And that is the reason tuning the ARC should help in your case. The zio_bufs that you referred to in the previous are the caches used by ARC for caching various things (including the metadata and the data). Thanks and regards, Sanjeev.> > Best Regards, > Jason > > On 1/10/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote: >> Hello Jason, >> >> Thursday, January 11, 2007, 12:36:46 AM, you wrote: >> >> JJWW> Hi Robert, >> >> JJWW> Thank you! Holy mackerel! That''s a lot of memory. With that >> type of a >> JJWW> calculation my 4GB arc_max setting is still in the danger zone >> on a >> JJWW> Thumper. I wonder if any of the ZFS developers could shed some >> light >> JJWW> on the calculation? >> >> JJWW> That kind of memory loss makes ZFS almost unusable for a >> database system. >> >> >> If you leave ncsize with default value then I belive it won''t consume >> that much memory. >> >> >> JJWW> I agree that a page cache similar to UFS would be much better. >> Linux >> JJWW> works similarly to free pages, and it has been effective enough >> in the >> JJWW> past. Though I''m equally unhappy about Linux''s tendency to grab >> every >> JJWW> bit of free RAM available for filesystem caching, and then cause >> JJWW> massive memory thrashing as it frees it for applications. >> >> Page cache won''t be better - just better memory control for ZFS caches >> is strongly desired. Unfortunately from time to time ZFS makes servers >> to page enormously :( >> >> >> -- >> Best regards, >> Robert mailto:rmilkowski at task.gda.pl >> http://milek.blogspot.com >> >>
Al Hopper wrote:> On Wed, 10 Jan 2007, Mark Maybee wrote: > >> Jason J. W. Williams wrote: >>> Hi Robert, >>> >>> Thank you! Holy mackerel! That''s a lot of memory. With that type of a >>> calculation my 4GB arc_max setting is still in the danger zone on a >>> Thumper. I wonder if any of the ZFS developers could shed some light >>> on the calculation? >>> >> In a worst-case scenario, Robert''s calculations are accurate to a >> certain degree: If you have 1GB of dnode_phys data in your arc cache >> (that would be about 1,200,000 files referenced), then this will result >> in another 3GB of "related" data held in memory: vnodes/znodes/ >> dnodes/etc. This related data is the in-core data associated with >> an accessed file. Its not quite true that this data is not evictable, >> it *is* evictable, but the space is returned from these kmem caches >> only after the arc has cleared its blocks and triggered the "free" of >> the related data structures (and even then, the kernel will need to >> to a kmem_reap to reclaim the memory from the caches). The >> fragmentation that Robert mentions is an issue because, if we don''t >> free everything, the kmem_reap may not be able to reclaim all the >> memory from these caches, as they are allocated in "slabs". >> >> We are in the process of trying to improve this situation. > .... snip ..... > > Understood (and many Thanks). In the meantime, is there a rule-of-thumb > that you could share that would allow mere humans (like me) to calculate > the best values of zfs:zfs_arc_max and ncsize, given the that machine has > nGb of RAM and is used in the following broad workload scenarios: > > a) a busy NFS server > b) a general multiuser development server > c) a database server > d) an Apache/Tomcat/FTP server > e) a single user Gnome desktop running U3 with home dirs on a ZFS > filesystem > > It would seem, from reading between the lines of previous emails, > particularly the ones you''ve (Mark M) written, that there is a rule of > thumb that would apply given a standard or modified ncsize tunable?? > > I''m primarily interested in a calculation that would allow settings that > would reduce the possibility of the machine descending into "swap hell". >Ideally, there would be no need for any tunables; ZFS would always "do the right thing". This is our grail. In the meantime, I can give some recommendations, but there is no "rule of thumb" that is going to work in all circumstances. ncsize: As I have mentioned previously, there are overheads associated with caching "vnode data" in ZFS. While the physical on-disk data for a znode is only 512bytes, the related in-core cost is significantly higher. Roughly, you can expect that each ZFS vnode held in the DNLC will cost about 3K of kernel memory. So, you need to set ncsize appropriately for how much memory you are willing to devote to it. 500,000 entries is going to cost you 1.5GB of memory. zfs_arc_max: This is the maximum amount of memory you want the ARC to be able to use. Note that the ARC won''t necessarily use this much memory: if other applications need memory, the ARC will shrink to accommodate. Although, also note that the ARC *can''t* shrink if all of its memory is held. For example, data in the DNLC cannot be evicted from the ARC, so this data must first be evicted from the DNLC before the ARC can free up space (this is why it is dangerous to turn off the ARCs ability to evict vnodes from the DNLC). Also keep in mind that the ARC size does not account for many in-core data structures used by ZFS (znodes/dnodes/ dbufs/etc). Roughly, for every 1MB of cached file pointers, you can expect another 3MB of memory used outside of the ARC. So, in the example above, where ncsize is 500,000, the ARC is only seeing about 400MB of the 1.5GB consumed. As I have stated previously, we consider this a bug in the current ARC accounting that we will soon fix. This is only an issue in environments where many files are being accessed. If the number of files accessed is relatively low, then the ARC size will be much closer to the actual memory consumed by ZFS. So, in general, you should not really need to "tune" zfs_arc_max. However, in environments where you have specific applications that consume known quantities of memory (e.g. database), it will likely help to set the ARC max size down, to guarantee that the necessary kernel memory is available. There may be other times when it will be beneficial to explicitly set the ARCs maximum size, but at this time I can''t offer any general "rule of thumb". I hope that helps. -Mark
On 11 January, 2007 - Mark Maybee sent me these 4,7K bytes:> >It would seem, from reading between the lines of previous emails, > >particularly the ones you''ve (Mark M) written, that there is a rule of > >thumb that would apply given a standard or modified ncsize tunable?? > > > >I''m primarily interested in a calculation that would allow settings that > >would reduce the possibility of the machine descending into "swap hell". > > > Ideally, there would be no need for any tunables; ZFS would always "do > the right thing". This is our grail. In the meantime, I can give some > recommendations, but there is no "rule of thumb" that is going to work > in all circumstances. > > ncsize: As I have mentioned previously, there are overheads > associated with caching "vnode data" in ZFS. While > the physical on-disk data for a znode is only 512bytes, > the related in-core cost is significantly higher. > Roughly, you can expect that each ZFS vnode held in > the DNLC will cost about 3K of kernel memory. > > So, you need to set ncsize appropriately for how much > memory you are willing to devote to it. 500,000 entries > is going to cost you 1.5GB of memory.Due to fragmentation, 200k entries can eat over 1.5GB memory too. http://www.acc.umu.se/~stric/tmp/dnlc-plot2.png This is only dnlc-related buffers on a 2GB machine.. the spike at the end caused the machine to more or less stand still.> zfs_arc_max: This is the maximum amount of memory you want the > ARC to be able to use. Note that the ARC won''t > necessarily use this much memory: if other applications > need memory, the ARC will shrink to accommodate. > Although, also note that the ARC *can''t* shrink if all > of its memory is held. For example, data in the DNLC > cannot be evicted from the ARC, so this data must first > be evicted from the DNLC before the ARC can free up > space (this is why it is dangerous to turn off the ARCs > ability to evict vnodes from the DNLC).I''ve tried that.. didn''t work out too great due to fragmentation.. Left non-kernel with like 4MB to play with.. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
Hi Mark, That does help tremendously. How does ZFS decide which zio cache to use? I apologize if this has already been addressed somewhere. Best Regards, Jason On 1/11/07, Mark Maybee <Mark.Maybee at sun.com> wrote:> Al Hopper wrote: > > On Wed, 10 Jan 2007, Mark Maybee wrote: > > > >> Jason J. W. Williams wrote: > >>> Hi Robert, > >>> > >>> Thank you! Holy mackerel! That''s a lot of memory. With that type of a > >>> calculation my 4GB arc_max setting is still in the danger zone on a > >>> Thumper. I wonder if any of the ZFS developers could shed some light > >>> on the calculation? > >>> > >> In a worst-case scenario, Robert''s calculations are accurate to a > >> certain degree: If you have 1GB of dnode_phys data in your arc cache > >> (that would be about 1,200,000 files referenced), then this will result > >> in another 3GB of "related" data held in memory: vnodes/znodes/ > >> dnodes/etc. This related data is the in-core data associated with > >> an accessed file. Its not quite true that this data is not evictable, > >> it *is* evictable, but the space is returned from these kmem caches > >> only after the arc has cleared its blocks and triggered the "free" of > >> the related data structures (and even then, the kernel will need to > >> to a kmem_reap to reclaim the memory from the caches). The > >> fragmentation that Robert mentions is an issue because, if we don''t > >> free everything, the kmem_reap may not be able to reclaim all the > >> memory from these caches, as they are allocated in "slabs". > >> > >> We are in the process of trying to improve this situation. > > .... snip ..... > > > > Understood (and many Thanks). In the meantime, is there a rule-of-thumb > > that you could share that would allow mere humans (like me) to calculate > > the best values of zfs:zfs_arc_max and ncsize, given the that machine has > > nGb of RAM and is used in the following broad workload scenarios: > > > > a) a busy NFS server > > b) a general multiuser development server > > c) a database server > > d) an Apache/Tomcat/FTP server > > e) a single user Gnome desktop running U3 with home dirs on a ZFS > > filesystem > > > > It would seem, from reading between the lines of previous emails, > > particularly the ones you''ve (Mark M) written, that there is a rule of > > thumb that would apply given a standard or modified ncsize tunable?? > > > > I''m primarily interested in a calculation that would allow settings that > > would reduce the possibility of the machine descending into "swap hell". > > > Ideally, there would be no need for any tunables; ZFS would always "do > the right thing". This is our grail. In the meantime, I can give some > recommendations, but there is no "rule of thumb" that is going to work > in all circumstances. > > ncsize: As I have mentioned previously, there are overheads > associated with caching "vnode data" in ZFS. While > the physical on-disk data for a znode is only 512bytes, > the related in-core cost is significantly higher. > Roughly, you can expect that each ZFS vnode held in > the DNLC will cost about 3K of kernel memory. > > So, you need to set ncsize appropriately for how much > memory you are willing to devote to it. 500,000 entries > is going to cost you 1.5GB of memory. > > zfs_arc_max: This is the maximum amount of memory you want the > ARC to be able to use. Note that the ARC won''t > necessarily use this much memory: if other applications > need memory, the ARC will shrink to accommodate. > Although, also note that the ARC *can''t* shrink if all > of its memory is held. For example, data in the DNLC > cannot be evicted from the ARC, so this data must first > be evicted from the DNLC before the ARC can free up > space (this is why it is dangerous to turn off the ARCs > ability to evict vnodes from the DNLC). > > Also keep in mind that the ARC size does not account for > many in-core data structures used by ZFS (znodes/dnodes/ > dbufs/etc). Roughly, for every 1MB of cached file > pointers, you can expect another 3MB of memory used > outside of the ARC. So, in the example above, where > ncsize is 500,000, the ARC is only seeing about 400MB > of the 1.5GB consumed. As I have stated previously, > we consider this a bug in the current ARC accounting > that we will soon fix. This is only an issue in > environments where many files are being accessed. If > the number of files accessed is relatively low, then > the ARC size will be much closer to the actual memory > consumed by ZFS. > > So, in general, you should not really need to "tune" > zfs_arc_max. However, in environments where you have > specific applications that consume known quantities of > memory (e.g. database), it will likely help to set the > ARC max size down, to guarantee that the necessary > kernel memory is available. There may be other times > when it will be beneficial to explicitly set the ARCs > maximum size, but at this time I can''t offer any general > "rule of thumb". > > I hope that helps. > > -Mark > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Jason J. W. Williams wrote:> Hi Mark, > > That does help tremendously. How does ZFS decide which zio cache to > use? I apologize if this has already been addressed somewhere. >The ARC caches data blocks in the zio_buf_xxx() cache that matches the block size. For example, dnode data is stored on disk in 16K blocks (32 dnodes/block), so zio_buf_16384() is used for those blocks. Most file data blocks (in large files) are stored in 128K blocks, so zio_buf_131072() is used, etc. -Mark
If I understand correctly, at least some systems claim not to guarantee consistency between changes to a file via write(2) and changes via mmap(2). But historically, at least in the case of regular files on local UFS, since Solaris used the page cache for both cases, the results should have been consistent. Since zfs uses somewhat different mechanisms, does it still have the same consistency between write(2) and mmap(2) that was historically present (whether or not "guaranteed") when using UFS on Solaris? This message posted from opensolaris.org
Richard, Richard L. Hamilton wrote:>If I understand correctly, at least some systems claim not to guarantee >consistency between changes to a file via write(2) and changes via mmap(2). >But historically, at least in the case of regular files on local UFS, since Solaris >used the page cache for both cases, the results should have been consistent. > >Since zfs uses somewhat different mechanisms, does it still have the same >consistency between write(2) and mmap(2) that was historically present >(whether or not "guaranteed") when using UFS on Solaris? > >Yes, it does have the consistency. There is specific code to keep the page cache (needed in case of mmaped files) and the ARC caches consistent. Thanks and regards, Sanjeev. -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel: x27521 +91 80 669 27521