One thing I''ve been confused about for a long time is the relationship between ZFS, the ARC, and the page cache. We have an application that''s a quasi-database. It reads files by mmap()ing them. (writes are done via write()). We''re talking 100TB of data in files that are 100k->50G in size (the files have headers to tell the app what segment to map, so mapped chunks are in the 100k->50M range, though sometimes it''s sequential.) I found it confusing that we ended up having to allocate a ton of swap to back anon pages behind all the mmap()ing. We never write to an mmap()ed space, so we don''t ever write to swap, so it''s not a huge deal, but it''s curious. In the old days of UFS, there was the page cache. You mmap()ed a file, it was allocated a range in your VM space, the pager paged in files on demand via VFS/UFS. UFS had a block cache, but it was only about big enough to deal with queueing; the page cache was the main cache. ZFS seems to break this model - the paging system and page cache is still there, but then there''s this ARC layer (and L2ARC layer) underneath it. If I read the concept right, it seems to work better with a world where users read()/write(), and all the caching is done within the ARC and the page cache exists primarily for process heap, and so the ARC expands and contracts as necessary to stay out of the way of process heap requirements but otherwise caching happens in ARC space. If I''m following this, what we''re doing is essentially duplicating - files exist in the page cache, but then they also exist in the ARC, and since the ARC is in kernel space and I presume that the VM subsystem doesn''t know that a page that happens to be in the page cache is actually in the ARC as well. As a result of this line of thinking, I''ve tuned the box such that the ARC is relatively small (10G out of 96), and is only caching metadata, with piles of L2ARC behind it, assuming that the page cache page is the one I need, letting the pager deal with what to keep in and out of RAM, and leaning on the I/O subsystem to make up for it. (This sounds less terrible than you think - the machine has 90 dual-port SAS-2 spindles across 6 LSI controllers with 12 x4 uplinks off the expanders, no daisy-chain, with OCZ Vertex2Pro L2ARCs. I can push 5GByte/sec on/off disk all day without sweating hard.) Is my line of thinking valid, or am I missing something? Thanks, -bacon
Hi Jeff, ZFS support for mmap() was something of an afterthought. The current Solaris virtual memory infrastructure didn''t have the features or performance required, which is why ZFS ended up with the ARC. Yes, you''ve got it. When we mmap() a ZFS file, there are two main caches involved: the ZFS ARC and the good old Solaris page cache. The reason for poor performance is the overhead of keeping the two caches in sync, but contention for RAM is also an issue. I once did some work on reducing the overhead after I found that a Thumper could only deliver about 150MB/sec with a database engine based on mmap(), compared with about 500MB/sec running the same database on Linux on the same hardware... 6699438 zfs induces crosscall storm under heavy mapped sequential read This was been fixed in OpenSolaris, and around Solaris 10 update 8, but it left quite a lot of performance on the table. I think this got the database performance up to about 300MB/sec (about 1/4 of what ZFS is capable of with non-mapped I/O). Clamping the ARC is probably a good thing in your case, but it only addresses part of the problem. Phil On 21/12/2010 13:58, Jeff Bacon wrote:> One thing I''ve been confused about for a long time is the relationship > between ZFS, the ARC, and the page cache. > > We have an application that''s a quasi-database. It reads files by > mmap()ing them. (writes are done via write()). We''re talking 100TB of > data in files that are 100k->50G in size (the files have headers to tell > the app what segment to map, so mapped chunks are in the 100k->50M > range, though sometimes it''s sequential.) > > I found it confusing that we ended up having to allocate a ton of swap > to back anon pages behind all the mmap()ing. We never write to an > mmap()ed space, so we don''t ever write to swap, so it''s not a huge deal, > but it''s curious. > > In the old days of UFS, there was the page cache. You mmap()ed a file, > it was allocated a range in your VM space, the pager paged in files on > demand via VFS/UFS. UFS had a block cache, but it was only about big > enough to deal with queueing; the page cache was the main cache. > > ZFS seems to break this model - the paging system and page cache is > still there, but then there''s this ARC layer (and L2ARC layer) > underneath it. If I read the concept right, it seems to work better with > a world where users read()/write(), and all the caching is done within > the ARC and the page cache exists primarily for process heap, and so the > ARC expands and contracts as necessary to stay out of the way of process > heap requirements but otherwise caching happens in ARC space. > > If I''m following this, what we''re doing is essentially duplicating - > files exist in the page cache, but then they also exist in the ARC, and > since the ARC is in kernel space and I presume that the VM subsystem > doesn''t know that a page that happens to be in the page cache is > actually in the ARC as well. > > As a result of this line of thinking, I''ve tuned the box such that the > ARC is relatively small (10G out of 96), and is only caching metadata, > with piles of L2ARC behind it, assuming that the page cache page is the > one I need, letting the pager deal with what to keep in and out of RAM, > and leaning on the I/O subsystem to make up for it. > > (This sounds less terrible than you think - the machine has 90 dual-port > SAS-2 spindles across 6 LSI controllers with 12 x4 uplinks off the > expanders, no daisy-chain, with OCZ Vertex2Pro L2ARCs. I can push > 5GByte/sec on/off disk all day without sweating hard.) > > Is my line of thinking valid, or am I missing something? > > Thanks, > -bacon > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Darren J Moffat
2010-Dec-21 14:31 UTC
[zfs-discuss] relationship between ARC and page cache
On 21/12/2010 14:25, Phil Harman wrote:> Hi Jeff, > > ZFS support for mmap() was something of an afterthought. The current > Solaris virtual memory infrastructure didn''t have the features or > performance required, which is why ZFS ended up with the ARC. > > Yes, you''ve got it. When we mmap() a ZFS file, there are two main caches > involved: the ZFS ARC and the good old Solaris page cache. The reason > for poor performance is the overhead of keeping the two caches in sync, > but contention for RAM is also an issue.> Clamping the ARC is probably a good thing in your case, but it only > addresses part of the problem.Another alternative to try would be setting primarycache=metadata on the ZFS dataset that contains the mmap files. That way you are only turning of the ZFS ARC cache of the file content for that one dataset rather than clamping the ARC. -- Darren J Moffat
On Tue, Dec 21, 2010 at 7:58 AM, Jeff Bacon <bacon at walleyesoftware.com>wrote:> One thing I''ve been confused about for a long time is the relationship > between ZFS, the ARC, and the page cache. > > We have an application that''s a quasi-database. It reads files by > mmap()ing them. (writes are done via write()). We''re talking 100TB of > data in files that are 100k->50G in size (the files have headers to tell > the app what segment to map, so mapped chunks are in the 100k->50M > range, though sometimes it''s sequential.) > > I found it confusing that we ended up having to allocate a ton of swap > to back anon pages behind all the mmap()ing. We never write to an > mmap()ed space, so we don''t ever write to swap, so it''s not a huge deal, > but it''s curious. > >Since others have already commented on the rest of this, I will note you can use the MAP_NORESERVE flag with mmap to prevent that behavior (which if the mmap''ed data is being altered, shouldn''t cause any issues). In the old days of UFS, there was the page cache. You mmap()ed a file,> it was allocated a range in your VM space, the pager paged in files on > demand via VFS/UFS. UFS had a block cache, but it was only about big > enough to deal with queueing; the page cache was the main cache. > > ZFS seems to break this model - the paging system and page cache is > still there, but then there''s this ARC layer (and L2ARC layer) > underneath it. If I read the concept right, it seems to work better with > a world where users read()/write(), and all the caching is done within > the ARC and the page cache exists primarily for process heap, and so the > ARC expands and contracts as necessary to stay out of the way of process > heap requirements but otherwise caching happens in ARC space. > > If I''m following this, what we''re doing is essentially duplicating - > files exist in the page cache, but then they also exist in the ARC, and > since the ARC is in kernel space and I presume that the VM subsystem > doesn''t know that a page that happens to be in the page cache is > actually in the ARC as well. > > As a result of this line of thinking, I''ve tuned the box such that the > ARC is relatively small (10G out of 96), and is only caching metadata, > with piles of L2ARC behind it, assuming that the page cache page is the > one I need, letting the pager deal with what to keep in and out of RAM, > and leaning on the I/O subsystem to make up for it. > > (This sounds less terrible than you think - the machine has 90 dual-port > SAS-2 spindles across 6 LSI controllers with 12 x4 uplinks off the > expanders, no daisy-chain, with OCZ Vertex2Pro L2ARCs. I can push > 5GByte/sec on/off disk all day without sweating hard.) > > Is my line of thinking valid, or am I missing something? > > Thanks, > -bacon > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101221/d1346399/attachment-0001.html>
> Another alternative to try would be setting primarycache=metadata onthe> ZFS dataset that contains the mmap files. That way you are onlyturning> of the ZFS ARC cache of the file content for that one dataset rather > than clamping the ARC.Yeah, you''d think that would be the right thing to do. It''s not. That resulted in throughput going through the floor, as it turns out the developers were occasionally using read() (I''m sorry, FileInputStream() - Java), which of course resulted in every single 2-byte read resulting in a fetch to the L2ARC since I disabled the primarycache for data blocks. I''d go with MAP_NORESERVE, but that option isn''t available on a Java MappedByteBuffer - most of which are opened R/O but occasionally get opened R/W. (Yes. Please try not to cringe. It does make sense, in context.) Fortunately, when you have 100TB to play with, a bit of disk allocated to swap that''s never used is not all so much of a sacrifice. So, to Phil''s email - read()/write() on a ZFS-backed vnode somehow completely bypass the page cache and depend only on the ARC? How the heck does that happen - I thought all files were represented as vm objects?
On 21/12/2010 21:53, Jeff Bacon wrote:> So, to Phil''s email - read()/write() on a ZFS-backed vnode somehow > completely bypass the page cache and depend only on the ARC? How the > heck does that happen - I thought all files were represented as vm > objects?For most other filesystems (and oversimplifying rather) the vnode ops for read/write are implemented by mapping the file into the kernel address space (via the segmap segment) and doing bcopy() operations between user buffers and the page cache (which in the case of reads causes page faults leading to getpage vnode ops). If some form of direct I/O is in place, segmap (and therefore, the page cache) is bypassed. In the case of ZFS the vnode ops for read/write call into the ZPL (ZFS POSIX Layer) unless the file is also mapped (in which case segmap has to be involved again, and the icky glue between the page cache and the ARC comes into play).