Iwan Aucamp
2012-May-28 19:06 UTC
[zfs-discuss] Remedies for suboptimal mmap performance on zfs
I''m getting sub-optimal performance with an mmap based database (mongodb) which is running on zfs of Solaris 10u9. System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * 4GB) ram (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks - a few mongodb instances are running with with moderate IO and total rss of 50 GB - a service which logs quite excessively (5GB every 20 mins) is also running (max 2GB ram use) - log files are compressed after some time to bzip2. Database performance is quite horrid though - it seems that zfs does not know how to manage allocation between page cache and arc cache - and it seems arc cache wins most of the time. I''m thinking of doing the following: - relocating mmaped (mongo) data to a zfs filesystem with only metadata cache - reducing zfs arc cache to 16 GB Is there any other recommendations - and is above likely to improve performance. -- Iwan Aucamp
Andrew Gabriel
2012-May-28 19:39 UTC
[zfs-discuss] Remedies for suboptimal mmap performance on zfs
On 05/28/12 20:06, Iwan Aucamp wrote:> I''m getting sub-optimal performance with an mmap based database > (mongodb) which is running on zfs of Solaris 10u9. > > System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * 4GB) > ram (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks > > - a few mongodb instances are running with with moderate IO and total > rss of 50 GB > - a service which logs quite excessively (5GB every 20 mins) is also > running (max 2GB ram use) - log files are compressed after some time > to bzip2. > > Database performance is quite horrid though - it seems that zfs does > not know how to manage allocation between page cache and arc cache - > and it seems arc cache wins most of the time. > > I''m thinking of doing the following: > - relocating mmaped (mongo) data to a zfs filesystem with only > metadata cache > - reducing zfs arc cache to 16 GB > > Is there any other recommendations - and is above likely to improve > performance.1. Upgrade to S10 Update 10 - this has various performance improvements, in particular related to database type loads (but I don''t know anything about mongodb). 2. Reduce the ARC size so RSS + ARC + other memory users < RAM size. I assume the RSS include''s whatever caching the database does. In theory, a database should be able to work out what''s worth caching better than any filesystem can guess from underneath it, so you want to configure more memory in the DB''s cache than in the ARC. (The default ARC tuning is unsuitable for a database server.) 3. If the database has some concept of blocksize or recordsize that it uses to perform i/o, make sure the filesystems it is using configured to be the same recordsize. The ZFS default recordsize (128kB) is usually much bigger than database blocksizes. This is probably going to have less impact with an mmaped database than a read(2)/write(2) database, where it may prove better to match the filesystem''s record size to the system''s page size (4kB, unless it''s using some type of large pages). I haven''t tried playing with recordsize for memory mapped i/o, so I''m speculating here. Blocksize or recordsize may apply to the log file writer too, and it may be that this needs a different recordsize and therefore has to be in a different filesystem. If it uses write(2) or some variant rather than mmap(2) and doesn''t document this in detail, Dtrace is your friend. 4. Keep plenty of free space in the zpool if you want good database performance. If you''re more than 60% full (S10U9) or 80% full (S10U10), that could be a factor. Anyway, there are a few things to think about. -- Andrew
Lionel Cons
2012-May-28 19:46 UTC
[zfs-discuss] Remedies for suboptimal mmap performance on zfs
On Mon, May 28, 2012 at 9:06 PM, Iwan Aucamp <aucampia at gmail.com> wrote:> I''m getting sub-optimal performance with an mmap based database (mongodb) > which is running on zfs of Solaris 10u9. > > System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * 4GB) ram > (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks > > - a few mongodb instances are running with with moderate IO and total rss > of 50 GB > - a service which logs quite excessively (5GB every 20 mins) is also > running (max 2GB ram use) - log files are compressed after some time to > bzip2. > > Database performance is quite horrid though - it seems that zfs does not > know how to manage allocation between page cache and arc cache - and it > seems arc cache wins most of the time. > > I''m thinking of doing the following: > - relocating mmaped (mongo) data to a zfs filesystem with only metadata > cache > - reducing zfs arc cache to 16 GB > > Is there any other recommendations - and is above likely to improve > performance.The only recommendation which will lead to results is to use a different OS or filesystem. Your choices are - FreeBSD with ZFS - Linux with BTRFS - Solaris with QFS - Solaris with UFS - Solaris with NFSv4, use ZFS on independent fileserver machines There''s a rather mythical rewrite of the Solaris virtual memory subsystem called VM2 in progress but it will still take a long time until this will become available for customers and there are no real data yet whether this will help with mmap performance. It won''t be available for Opensolaris successors like Illumos available either (likely never, at least the Illumos leadership doesn''t see the need for this and instead recommends to rewrite the applications to not use mmap). Lionel
Richard Elling
2012-May-28 20:10 UTC
[zfs-discuss] Remedies for suboptimal mmap performance on zfs
On May 28, 2012, at 12:46 PM, Lionel Cons wrote:> On Mon, May 28, 2012 at 9:06 PM, Iwan Aucamp <aucampia at gmail.com> wrote: >> I''m getting sub-optimal performance with an mmap based database (mongodb) >> which is running on zfs of Solaris 10u9. >> >> System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * 4GB) ram >> (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks >> >> - a few mongodb instances are running with with moderate IO and total rss >> of 50 GB >> - a service which logs quite excessively (5GB every 20 mins) is also >> running (max 2GB ram use) - log files are compressed after some time to >> bzip2. >> >> Database performance is quite horrid though - it seems that zfs does not >> know how to manage allocation between page cache and arc cache - and it >> seems arc cache wins most of the time. >> >> I''m thinking of doing the following: >> - relocating mmaped (mongo) data to a zfs filesystem with only metadata >> cache >> - reducing zfs arc cache to 16 GB >> >> Is there any other recommendations - and is above likely to improve >> performance. > > The only recommendation which will lead to results is to use a > different OS or filesystem. Your choices are > - FreeBSD with ZFS > - Linux with BTRFS > - Solaris with QFS > - Solaris with UFS > - Solaris with NFSv4, use ZFS on independent fileserver machines > > There''s a rather mythical rewrite of the Solaris virtual memory > subsystem called VM2 in progress but it will still take a long time > until this will become available for customers and there are no real > data yet whether this will help with mmap performance. It won''t be > available for Opensolaris successors like Illumos available either > (likely never, at least the Illumos leadership doesn''t see the need > for this and instead recommends to rewrite the applications to not use > mmap).This is a mischaracterization of the statements given. The illumos team says they will not implement Oracle''s VM2 for valid, legal reasons. That does not mean that mmap performance improvements for ZFS cannot be implemented via other methods. The primary concern for mmap files is that the RAM footprint is doubled. If you do not manage this via limits, there can be a fight between the page cache and ARC over a constrained RAM resource. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120528/8c07e736/attachment-0001.html>
Iwan Aucamp
2012-May-28 20:25 UTC
[zfs-discuss] Remedies for suboptimal mmap performance on zfs
On 05/28/2012 10:12 PM, Andrew Gabriel wrote:> On 05/28/12 20:06, Iwan Aucamp wrote: >> I''m thinking of doing the following: >> - relocating mmaped (mongo) data to a zfs filesystem with only >> metadata cache >> - reducing zfs arc cache to 16 GB >> >> Is there any other recommendations - and is above likely to improve >> performance. > 1. Upgrade to S10 Update 10 - this has various performance improvements, > in particular related to database type loads (but I don''t know anything > about mongodb). > > 2. Reduce the ARC size so RSS + ARC + other memory users< RAM size. > I assume the RSS include''s whatever caching the database does. In > theory, a database should be able to work out what''s worth caching > better than any filesystem can guess from underneath it, so you want to > configure more memory in the DB''s cache than in the ARC. (The default > ARC tuning is unsuitable for a database server.) > > 3. If the database has some concept of blocksize or recordsize that it > uses to perform i/o, make sure the filesystems it is using configured to > be the same recordsize. The ZFS default recordsize (128kB) is usually > much bigger than database blocksizes. This is probably going to have > less impact with an mmaped database than a read(2)/write(2) database, > where it may prove better to match the filesystem''s record size to the > system''s page size (4kB, unless it''s using some type of large pages). I > haven''t tried playing with recordsize for memory mapped i/o, so I''m > speculating here. > > Blocksize or recordsize may apply to the log file writer too, and it may > be that this needs a different recordsize and therefore has to be in a > different filesystem. If it uses write(2) or some variant rather than > mmap(2) and doesn''t document this in detail, Dtrace is your friend. > > 4. Keep plenty of free space in the zpool if you want good database > performance. If you''re more than 60% full (S10U9) or 80% full (S10U10), > that could be a factor. > > Anyway, there are a few things to think about.Thanks for the Feedback, I cannot really do 1, but will look into points 3 and 4 - in addition to 2 - which is what I desire to achieve with my second point - but I would still like to know if it is recommended to only do metadata caching for mmaped files (mongodb data files) - the way I see it this should get rid of the double caching which is being done for mmaped files.
Richard Elling
2012-May-28 20:34 UTC
[zfs-discuss] Remedies for suboptimal mmap performance on zfs
question below... On May 28, 2012, at 1:25 PM, Iwan Aucamp wrote:> On 05/28/2012 10:12 PM, Andrew Gabriel wrote: >> On 05/28/12 20:06, Iwan Aucamp wrote: >>> I''m thinking of doing the following: >>> - relocating mmaped (mongo) data to a zfs filesystem with only >>> metadata cache >>> - reducing zfs arc cache to 16 GB >>> >>> Is there any other recommendations - and is above likely to improve >>> performance. >> 1. Upgrade to S10 Update 10 - this has various performance improvements, >> in particular related to database type loads (but I don''t know anything >> about mongodb). >> >> 2. Reduce the ARC size so RSS + ARC + other memory users< RAM size. >> I assume the RSS include''s whatever caching the database does. In >> theory, a database should be able to work out what''s worth caching >> better than any filesystem can guess from underneath it, so you want to >> configure more memory in the DB''s cache than in the ARC. (The default >> ARC tuning is unsuitable for a database server.) >> >> 3. If the database has some concept of blocksize or recordsize that it >> uses to perform i/o, make sure the filesystems it is using configured to >> be the same recordsize. The ZFS default recordsize (128kB) is usually >> much bigger than database blocksizes. This is probably going to have >> less impact with an mmaped database than a read(2)/write(2) database, >> where it may prove better to match the filesystem''s record size to the >> system''s page size (4kB, unless it''s using some type of large pages). I >> haven''t tried playing with recordsize for memory mapped i/o, so I''m >> speculating here. >> >> Blocksize or recordsize may apply to the log file writer too, and it may >> be that this needs a different recordsize and therefore has to be in a >> different filesystem. If it uses write(2) or some variant rather than >> mmap(2) and doesn''t document this in detail, Dtrace is your friend. >> >> 4. Keep plenty of free space in the zpool if you want good database >> performance. If you''re more than 60% full (S10U9) or 80% full (S10U10), >> that could be a factor. >> >> Anyway, there are a few things to think about. > > Thanks for the Feedback, I cannot really do 1, but will look into points 3 and 4 - in addition to 2 - which is what I desire to achieve with my second point - but I would still like to know if it is recommended to only do metadata caching for mmaped files (mongodb data files) - the way I see it this should get rid of the double caching which is being done for mmaped files.I''d be interested in the results of such tests. You can change the primarycache parameter on the fly, so you could test it in less time than it takes for me to type this email :-) -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120528/54f6d23b/attachment.html>
Lionel Cons
2012-May-28 21:18 UTC
[zfs-discuss] Remedies for suboptimal mmap performance on zfs
On 28 May 2012 22:10, Richard Elling <richard.elling at gmail.com> wrote:> The only recommendation which will lead to results is to use a > different OS or filesystem. Your choices are > - FreeBSD with ZFS > - Linux with BTRFS > - Solaris with QFS > - Solaris with UFS > - Solaris with NFSv4, use ZFS on independent fileserver machines > > There''s a rather mythical rewrite of the Solaris virtual memory > subsystem called VM2 in progress but it will still take a long time > until this will become available for customers and there are no real > data yet whether this will help with mmap performance. It won''t be > available for Opensolaris successors like Illumos available either > (likely never, at least the Illumos leadership doesn''t see the need > for this and instead recommends to rewrite the applications to not use > mmap). > > > This is a mischaracterization of the statements given. The illumos team > says they will not implement Oracle''s VM2 for valid, legal reasons. > That does not mean that mmap performance improvements for ZFS > cannot be implemented via other methods.I''d like to hear what the other methods should be. The lack of mmap performance is only a symptom of a more severe disease. Just doing piecework and alter the VFS API to integrate ZFS/ARC/VM with each other doesn''t fix the underlying problems. I''ve assigned two of my staff, one familiar with the FreeBSD VM and one familiar with the Linux VM, to look at the current VM subsystem and their preliminary reports point to disaster. If Illumos does not initiate a VM rewrite project of it''s own which will make the VM aware of NUMA, power management and other issues then I predict nothing less than the downfall of Illumos within a couple of years because the performance impact is dramatic and makes the Illumos kernel no longer competitive. Despite these findings, of which Sun was aware for a long time, and the number of ex-Sun employees working on Illumos, I miss the commitment to launch such a project. That''s why I said "likely never", unless of course someone slams Garrett''s head with sufficient force on a wooden table to make him see the reality. The reality is: - The modern x86 server platforms are now all NUMA or NUMA-like. Lack of NUMA support leads to bad performance - They all use some kind of serialized link between CPU nodes, let it be Hypertransport or Quickpath, with power management. If power management is active and has reduced the number of active links between nodes and the OS doesn''t manage this correctly you''ll get bad performance. Illumo''s VM isn''t even remotely aware of this fact - Based on simulator testing we see that in a simulated environment with 8 sockets almost 40% of kernel memory accesses are _REMOTE_ accesses, i.e. it''s not local to the node accessing it That are all preliminary results, I expect that the remainder of the analysis will take another 4-5 weeks until we present the findings to the Illumos community. But I can say already it will be a faceslap for those who think that Illumos doesn''t need a better VM system.> The primary concern for mmap files is that the RAM footprint is doubled.It''s not only that RAM is doubled, the data are copied between both ARC and page cache multiple times. You can say memory and the in memory copy operation are cheap, but this and the lack of NUMA awareness is a real performance killer. Lionel
Richard Elling
2012-May-28 21:40 UTC
[zfs-discuss] Remedies for suboptimal mmap performance on zfs
[Apologies to the list, this has expanded past ZFS, if someone complains, we can move the thread to another illumos dev list] On May 28, 2012, at 2:18 PM, Lionel Cons wrote:> On 28 May 2012 22:10, Richard Elling <richard.elling at gmail.com> wrote: >> The only recommendation which will lead to results is to use a >> different OS or filesystem. Your choices are >> - FreeBSD with ZFS >> - Linux with BTRFS >> - Solaris with QFS >> - Solaris with UFS >> - Solaris with NFSv4, use ZFS on independent fileserver machines >> >> There''s a rather mythical rewrite of the Solaris virtual memory >> subsystem called VM2 in progress but it will still take a long time >> until this will become available for customers and there are no real >> data yet whether this will help with mmap performance. It won''t be >> available for Opensolaris successors like Illumos available either >> (likely never, at least the Illumos leadership doesn''t see the need >> for this and instead recommends to rewrite the applications to not use >> mmap). >> >> >> This is a mischaracterization of the statements given. The illumos team >> says they will not implement Oracle''s VM2 for valid, legal reasons. >> That does not mean that mmap performance improvements for ZFS >> cannot be implemented via other methods. > > I''d like to hear what the other methods should be. The lack of mmap > performance is only a symptom of a more severe disease. Just doing > piecework and alter the VFS API to integrate ZFS/ARC/VM with each > other doesn''t fix the underlying problems. > > I''ve assigned two of my staff, one familiar with the FreeBSD VM and > one familiar with the Linux VM, to look at the current VM subsystem > and their preliminary reports point to disaster. If Illumos does not > initiate a VM rewrite project of it''s own which will make the VM aware > of NUMA, power management and other issues then I predict nothing less > than the downfall of Illumos within a couple of years because the > performance impact is dramatic and makes the Illumos kernel no longer > competitive. > Despite these findings, of which Sun was aware for a long time, and > the number of ex-Sun employees working on Illumos, I miss the > commitment to launch such a project. That''s why I said "likely never", > unless of course someone slams Garrett''s head with sufficient force on > a wooden table to make him see the reality. > > The reality is: > - The modern x86 server platforms are now all NUMA or NUMA-like. Lack > of NUMA support leads to bad performanceSPARC has been NUMA since 1997 and Solaris changed the scheduler long ago.> - They all use some kind of serialized link between CPU nodes, let it > be Hypertransport or Quickpath, with power management. If power > management is active and has reduced the number of active links > between nodes and the OS doesn''t manage this correctly you''ll get bad > performance. Illumo''s VM isn''t even remotely aware of this fact > - Based on simulator testing we see that in a simulated environment > with 8 sockets almost 40% of kernel memory accesses are _REMOTE_ > accesses, i.e. it''s not local to the node accessing it > That are all preliminary results, I expect that the remainder of the > analysis will take another 4-5 weeks until we present the findings to > the Illumos community. But I can say already it will be a faceslap for > those who think that Illumos doesn''t need a better VM system.Nobody said illumos doesn''t need a better VM system. The statement was that illumos is not going to reverse-engineer Oracle''s VM2.>> The primary concern for mmap files is that the RAM footprint is doubled. > > It''s not only that RAM is doubled, the data are copied between both > ARC and page cache multiple times. You can say memory and the in > memory copy operation are cheap, but this and the lack of NUMA > awareness is a real performance killer.Anybody who has worked on a SPARC system for the past 15 years is well aware of NUMAness. We''ve been living in a NUMA world for a very long time, a world where the processors were slow and far memory latency is much, much worse than we see in the x86 world. I look forward to seeing the results of your analysis and experiments. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120528/3b416692/attachment-0001.html>
Jim Klimov
2012-May-28 22:10 UTC
[zfs-discuss] Remedies for suboptimal mmap performance on zfs
2012-05-29 0:34, Richard Elling wrote:> I''d be interested in the results of such tests. You can change the > primarycache > parameter on the fly, so you could test it in less time than it takes > for me to type > this email :-)I believe it would also take some time for memory distribution to settle, expiring ARC data pages and actually claiming the RAM for the application... Right? ;) //Jim
Daniel Carosone
2012-May-29 01:29 UTC
[zfs-discuss] Remedies for suboptimal mmap performance on zfs
On Mon, May 28, 2012 at 01:34:18PM -0700, Richard Elling wrote:> I''d be interested in the results of such tests.Me too, especially for databases like postgresql where there''s a complementary cache size tunable within the db that often needs to be turned up, since they implicitly rely on some filesystem caching as a L2. That''s where this gets tricky: L2ARC has the opportunity to make a big difference, where the entire db won''t all fit in memory (regardless of which subsystem has jurisdiction over that memory). If you exclude data from ARC, you can''t spill it to L2ARC. For the mmap case: does the ARC keep a separate copy, or does the vm system map the same page into the process''s address space? If a separate copy is made, that seems like a potential source of many kinds of problems - if it''s the same page then the whole premise is essentially moot and there''s no "double caching". -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120529/f529e154/attachment.bin>
Iwan Aucamp
2012-May-29 19:42 UTC
[zfs-discuss] Remedies for suboptimal mmap performance on zfs
On 05/29/2012 03:29 AM, Daniel Carosone wrote:> For the mmap case: does the ARC keep a separate copy, or does the vm > system map the same page into the process''s address space? If a > separate copy is made, that seems like a potential source of many > kinds of problems - if it''s the same page then the whole premise is > essentially moot and there''s no "double caching".As far as I understand, for mmap case, is that the page cache is distinct from ARC (i.e. normal simplified flow for reading from disk with mmap is DSK->ARC->PageCache) - and only page cache gets mapped into processes address space - which is what results in the double caching. I have two other general questions regarding page cache with ZFS + Solaris: - Does anything else except mmap still use the page cache ? - Is there a parameter similar to /proc/sys/vm/swappiness that can control how long unused pages in page cache stay in physical ram if there is no shortage of physical ram ? And if not how long will unused pages stay in page cache stay in physical ram given there is no shortage of physical ram ? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120529/57584d66/attachment.html>
Bob Friesenhahn
2012-May-31 01:55 UTC
[zfs-discuss] Remedies for suboptimal mmap performance on zfs
On Tue, 29 May 2012, Iwan Aucamp wrote:> ?- Is there a? parameter similar to /proc/sys/vm/swappiness that can control how long unused pages in page cache stay in physical ram > if there is no shortage of physical ram ? And if not how long will unused pages stay in page cache stay in physical ram given there > is no shortage of physical ram ?Absent pressure for memory, no longer referenced pages will stay in memory forever. They can then be re-referenced in memory. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jeff Bacon
2012-Jun-01 12:27 UTC
[zfs-discuss] Remedies for suboptimal mmap performance on zfs
> I''m getting sub-optimal performance with an mmap based database > (mongodb) which is running on zfs of Solaris 10u9. > > System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * > 4GB) > ram (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks > > - a few mongodb instances are running with with moderate IO and total > rss of 50 GB > - a service which logs quite excessively (5GB every 20 mins) is also > running (max 2GB ram use) - log files are compressed after some > time to bzip2. > > Database performance is quite horrid though - it seems that zfs does not > know how to manage allocation between page cache and arc cache - and it > seems arc cache wins most of the time.Or to be more accurate, there is no coordination that I am aware of between the VM page cache and the ARC. Which, for all the glories of ZFS, strikes me as a *doh*face-in-palm* how-did-we-miss-this sorta thing. One of these days I need to ask Jeff and Bill what they were thinking. We went through this 9 months ago - we wrote MongoDB, which attempted to mmap() whole database files for the purpose of skimming back and forth through them quickly (think column-oriented database). Performance, um, sucked. There is a practical limit to the amount of RAM you can shove into a machine - and said RAM gets slower as you have to go to quad-rank DIMMs, which Nehalem can''t run at full speed - for the sort of box you speak of, your top end of 1333Mhz is 96G, last I checked. (We''re at 192G in most cases.) So while yes copying the data around between VM and ARC is doable, in large quantities that are invariably going to blow the CPU L3, this may not be the most practical answer. It didn''t help of course that a) said DB was implemented in Java - _please_ don''t ask - which is hardly a poster child for implementing any form of mmap(), not to mention spins a ton of threads b) said machine _started_ with 72 2TB Constellations and a pack of Cheetahs arranged in 7 pools, resulting in ~700 additional kernel threads roaming around, all of which got woken up on any heavy disk access (yes they could have all been in one pool - and yes there is a specific reason for not doing so) but and still. We managed to break ZFS as a result. There are a couple of cases filed. One is semi-patched, the other we''re told simply can''t be fixed in Solaris 10. Fortunately we understand the conditions that create the breakage, and work around it by Just Not Doing That(tm). In your configuration, I can almost guarantee you will not run into them.> > I''m thinking of doing the following: > - relocating mmaped (mongo) data to a zfs filesystem with only > metadata cache > - reducing zfs arc cache to 16 GB > > Is there any other recommendations - and is above likely to improve > performance.Well... we ended up (a) rewriting MongoDB to use in-process "buffer workspaces" and read()/write() to fill/dump the buffers to disk (essentially, giving up on mmap()) (b) moving most of the workload to CentOS and using the Solaris boxes as big fast NFSv3 fileservers (NFSv4 didn''t work out so well for us) over 10G, because for most workloads it runs 5-8% faster on CentOS than Solaris, and we''re primarily a CentOS shop anyway so it was just easier for everyone to deal with - but this has little to do with mmap() difficulties Given what I know of the Solaris VM, VFS and of ZFS as implemented - admittedly incomplete, and my VM knowledge is based mostly on SVR4 - it would seem to me that it is going to take some Really Creative Thinking to work around the mmap() problem - a tweak or two ain''t gonna cut it. -bacon
Jeff Bacon
2012-Jun-01 12:33 UTC
[zfs-discuss] Remedies for suboptimal mmap performance on zfs
> I''d be interested in the results of such tests. You can change the primarycache > parameter on the fly, so you could test it in less time than it > takes for me to type this email :-) > -- RichardTried that. Performance headed south like a cat with its tail on fire. We didn''t bother quantifying, it was just that hideous. (You know, us northern-hemisphere people always use "south" as a "down" direction. Is it different for people in the southern hemisphere? :) ) There''s just too many _other_ little things running around a normal system for which NOT having primarycache is just too painful to contemplate (even with L2ARC) that, while I can envisage situations where one might want to do that, they''re very very few and far between. -bacon
Jeff Bacon
2012-Jun-01 12:36 UTC
[zfs-discuss] Remedies for suboptimal mmap performance on zfs
> Anybody who has worked on a SPARC system for the past 15 years is well > aware of NUMAness. We''ve been living in a NUMA world for a very long time, > a world where the processors were slow and far memory latency is much, much > worse than we see in the x86 world. > > I look forward to seeing the results of your analysis and > experiments. > -- Richardlike, um, seconded. Please. I''m very curious to learn of a "VM2" effort. (Sadly, I spend more time nowadays with my nose stuck into Cisco kit than into Solaris - well, not sadly, they''re both interesting - but I''m out of touch with much of what''s going on in Solaris world anymore.) It makes sense though. And perhaps it''s well overdue. The basic notions of the VM subsys haven''t changed in what, 15 years? Ain''t-broke-don''t-fix sure but ... -bacon
Iwan Aucamp
2012-Jun-01 13:41 UTC
[zfs-discuss] Remedies for suboptimal mmap performance on zfs
On 06/01/2012 02:33 PM, Jeff Bacon wrote:>> I''d be interested in the results of such tests. You can change the primarycache >> parameter on the fly, so you could test it in less time than it >> takes for me to type this email :-) >> -- Richard > Tried that. Performance headed south like a cat with its tail on fire. We didn''t bother quantifying, it was just that hideous. > > (You know, us northern-hemisphere people always use "south" as a "down" direction. Is it different for people in the southern hemisphere? :) ) > > There''s just too many _other_ little things running around a normal system for which NOT having primarycache is just too painful to contemplate (even with L2ARC) that, while I can envisage situations where one might want to do that, they''re very very few and far between.Thanks for the valuable feedback Jeff, though I think you might misunderstand - the idea is to make a zfs filesystem just for the files being mmaped by mongo - the idea is to only disable ARC where there is double caching involved (i.e. for mmaped files) - leaving rest of the system with ARC and taking ARC out of the picture with MongoDB.