thr3ads.net - zfs discuss - [zfs-discuss] relationship between ARC and page cache [Dec 2010]

If this information is useful, please help other people find it:
Share via:

Jeff Bacon

2010-Dec-21 13:58 UTC

[zfs-discuss] relationship between ARC and page cache

One thing I''ve been confused about for a long time is the relationship
between ZFS, the ARC, and the page cache. 

We have an application that''s a quasi-database. It reads files by
mmap()ing them. (writes are done via write()). We''re talking 100TB of
data in files that are 100k->50G in size (the files have headers to tell
the app what segment to map, so mapped chunks are in the 100k->50M
range, though sometimes it''s sequential.) 

I found it confusing that we ended up having to allocate a ton of swap
to back anon pages behind all the mmap()ing. We never write to an
mmap()ed space, so we don''t ever write to swap, so it''s not a
huge deal,
but it''s curious. 

In the old days of UFS, there was the page cache. You mmap()ed a file,
it was allocated a range in your VM space, the pager paged in files on
demand via VFS/UFS. UFS had a block cache, but it was only about big
enough to deal with queueing; the page cache was the main cache. 

ZFS seems to break this model - the paging system and page cache is
still there, but then there''s this ARC layer (and L2ARC layer)
underneath it. If I read the concept right, it seems to work better with
a world where users read()/write(), and all the caching is done within
the ARC and the page cache exists primarily for process heap, and so the
ARC expands and contracts as necessary to stay out of the way of process
heap requirements but otherwise caching happens in ARC space.

If I''m following this, what we''re doing is essentially
duplicating -
files exist in the page cache, but then they also exist in the ARC, and
since the ARC is in kernel space and I presume that the VM subsystem
doesn''t know that a page that happens to be in the page cache is
actually in the ARC as well. 

As a result of this line of thinking, I''ve tuned the box such that the
ARC is relatively small (10G out of 96), and is only caching metadata,
with piles of L2ARC behind it, assuming that the page cache page is the
one I need, letting the pager deal with what to keep in and out of RAM,
and leaning on the I/O subsystem to make up for it. 

(This sounds less terrible than you think - the machine has 90 dual-port
SAS-2 spindles across 6 LSI controllers with 12 x4 uplinks off the
expanders, no daisy-chain, with OCZ Vertex2Pro L2ARCs. I can push
5GByte/sec on/off disk all day without sweating hard.) 

Is my line of thinking valid, or am I missing something? 

Thanks,
-bacon

Phil Harman

2010-Dec-21 14:25 UTC

head link

[zfs-discuss] relationship between ARC and page cache

Hi Jeff,

ZFS support for mmap() was something of an afterthought. The current 
Solaris virtual memory infrastructure didn''t have the features or 
performance required, which is why ZFS ended up with the ARC.

Yes, you''ve got it. When we mmap() a ZFS file, there are two main
caches
involved: the ZFS ARC and the good old Solaris page cache. The reason 
for poor performance is the overhead of keeping the two caches in sync, 
but contention for RAM is also an issue.

I once did some work on reducing the overhead after I found that a 
Thumper could only deliver about 150MB/sec with a database engine based 
on mmap(), compared with about 500MB/sec running the same database on 
Linux on the same hardware...

   6699438 zfs induces crosscall storm under heavy mapped sequential read

This was been fixed in OpenSolaris, and around Solaris 10 update 8, but 
it left quite a lot of performance on the table. I think this got the 
database performance up to about 300MB/sec (about 1/4 of what ZFS is 
capable of with non-mapped I/O).

Clamping the ARC is probably a good thing in your case, but it only 
addresses part of the problem.

Phil



On 21/12/2010 13:58, Jeff Bacon wrote:> One thing I''ve been confused about for a long time is the
relationship
> between ZFS, the ARC, and the page cache.
>
> We have an application that''s a quasi-database. It reads files by
> mmap()ing them. (writes are done via write()). We''re talking 100TB
of
> data in files that are 100k->50G in size (the files have headers to tell
> the app what segment to map, so mapped chunks are in the 100k->50M
> range, though sometimes it''s sequential.)
>
> I found it confusing that we ended up having to allocate a ton of swap
> to back anon pages behind all the mmap()ing. We never write to an
> mmap()ed space, so we don''t ever write to swap, so it''s
not a huge deal,
> but it''s curious.
>
> In the old days of UFS, there was the page cache. You mmap()ed a file,
> it was allocated a range in your VM space, the pager paged in files on
> demand via VFS/UFS. UFS had a block cache, but it was only about big
> enough to deal with queueing; the page cache was the main cache.
>
> ZFS seems to break this model - the paging system and page cache is
> still there, but then there''s this ARC layer (and L2ARC layer)
> underneath it. If I read the concept right, it seems to work better with
> a world where users read()/write(), and all the caching is done within
> the ARC and the page cache exists primarily for process heap, and so the
> ARC expands and contracts as necessary to stay out of the way of process
> heap requirements but otherwise caching happens in ARC space.
>
> If I''m following this, what we''re doing is essentially
duplicating -
> files exist in the page cache, but then they also exist in the ARC, and
> since the ARC is in kernel space and I presume that the VM subsystem
> doesn''t know that a page that happens to be in the page cache is
> actually in the ARC as well.
>
> As a result of this line of thinking, I''ve tuned the box such that
the
> ARC is relatively small (10G out of 96), and is only caching metadata,
> with piles of L2ARC behind it, assuming that the page cache page is the
> one I need, letting the pager deal with what to keep in and out of RAM,
> and leaning on the I/O subsystem to make up for it.
>
> (This sounds less terrible than you think - the machine has 90 dual-port
> SAS-2 spindles across 6 LSI controllers with 12 x4 uplinks off the
> expanders, no daisy-chain, with OCZ Vertex2Pro L2ARCs. I can push
> 5GByte/sec on/off disk all day without sweating hard.)
>
> Is my line of thinking valid, or am I missing something?
>
> Thanks,
> -bacon
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Darren J Moffat

2010-Dec-21 14:31 UTC

head link

[zfs-discuss] relationship between ARC and page cache

On 21/12/2010 14:25, Phil Harman wrote:> Hi Jeff,
>
> ZFS support for mmap() was something of an afterthought. The current
> Solaris virtual memory infrastructure didn''t have the features or
> performance required, which is why ZFS ended up with the ARC.
>
> Yes, you''ve got it. When we mmap() a ZFS file, there are two main
caches
> involved: the ZFS ARC and the good old Solaris page cache. The reason
> for poor performance is the overhead of keeping the two caches in sync,
> but contention for RAM is also an issue.
> Clamping the ARC is probably a good thing in your case, but it only
> addresses part of the problem.
Another alternative to try would be setting primarycache=metadata on the 
ZFS dataset that contains the mmap files.  That way you are only turning 
of the ZFS ARC cache of the file content for that one dataset rather 
than clamping the ARC.

-- 
Darren J Moffat

Jason

2010-Dec-21 15:31 UTC

head link

[zfs-discuss] relationship between ARC and page cache

On Tue, Dec 21, 2010 at 7:58 AM, Jeff Bacon <bacon at
walleyesoftware.com>wrote:
> One thing I''ve been confused about for a long time is the
relationship
> between ZFS, the ARC, and the page cache.
>
> We have an application that''s a quasi-database. It reads files by
> mmap()ing them. (writes are done via write()). We''re talking 100TB
of
> data in files that are 100k->50G in size (the files have headers to tell
> the app what segment to map, so mapped chunks are in the 100k->50M
> range, though sometimes it''s sequential.)
>
> I found it confusing that we ended up having to allocate a ton of swap
> to back anon pages behind all the mmap()ing. We never write to an
> mmap()ed space, so we don''t ever write to swap, so it''s
not a huge deal,
> but it''s curious.
>
>Since others have already commented on the rest of this, I will note you can
use the MAP_NORESERVE flag with mmap to prevent that behavior (which if the
mmap''ed data is being altered, shouldn''t cause any issues).

In the old days of UFS, there was the page cache. You mmap()ed a
file,> it was allocated a range in your VM space, the pager paged in files on
> demand via VFS/UFS. UFS had a block cache, but it was only about big
> enough to deal with queueing; the page cache was the main cache.
>
> ZFS seems to break this model - the paging system and page cache is
> still there, but then there''s this ARC layer (and L2ARC layer)
> underneath it. If I read the concept right, it seems to work better with
> a world where users read()/write(), and all the caching is done within
> the ARC and the page cache exists primarily for process heap, and so the
> ARC expands and contracts as necessary to stay out of the way of process
> heap requirements but otherwise caching happens in ARC space.
>
> If I''m following this, what we''re doing is essentially
duplicating -
> files exist in the page cache, but then they also exist in the ARC, and
> since the ARC is in kernel space and I presume that the VM subsystem
> doesn''t know that a page that happens to be in the page cache is
> actually in the ARC as well.
>
> As a result of this line of thinking, I''ve tuned the box such that
the
> ARC is relatively small (10G out of 96), and is only caching metadata,
> with piles of L2ARC behind it, assuming that the page cache page is the
> one I need, letting the pager deal with what to keep in and out of RAM,
> and leaning on the I/O subsystem to make up for it.
>
> (This sounds less terrible than you think - the machine has 90 dual-port
> SAS-2 spindles across 6 LSI controllers with 12 x4 uplinks off the
> expanders, no daisy-chain, with OCZ Vertex2Pro L2ARCs. I can push
> 5GByte/sec on/off disk all day without sweating hard.)
>
> Is my line of thinking valid, or am I missing something?
>
> Thanks,
> -bacon
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101221/d1346399/attachment-0001.html>

Jeff Bacon

2010-Dec-21 21:53 UTC

head link

[zfs-discuss] relationship between ARC and page cache

> Another alternative to try would be setting primarycache=metadata on
the> ZFS dataset that contains the mmap files.  That way you are only
turning> of the ZFS ARC cache of the file content for that one dataset rather
> than clamping the ARC.
Yeah, you''d think that would be the right thing to do. 

It''s not. That resulted in throughput going through the floor, as it
turns out the developers were occasionally using read() (I''m sorry,
FileInputStream() - Java), which of course resulted in every single
2-byte read resulting in a fetch to the L2ARC since I disabled the
primarycache for data blocks. 

I''d go with MAP_NORESERVE, but that option isn''t available on
a Java
MappedByteBuffer - most of which are opened R/O but occasionally get
opened R/W. (Yes. Please try not to cringe. It does make sense, in
context.) Fortunately, when you have 100TB to play with, a bit of disk
allocated to swap that''s never used is not all so much of a sacrifice.

So, to Phil''s email - read()/write() on a ZFS-backed vnode somehow
completely bypass the page cache and depend only on the ARC? How the
heck does that happen - I thought all files were represented as vm
objects?

Phil Harman

2010-Dec-22 23:03 UTC

head link

[zfs-discuss] relationship between ARC and page cache

On 21/12/2010 21:53, Jeff Bacon wrote:> So, to Phil''s email - read()/write() on a ZFS-backed vnode somehow
> completely bypass the page cache and depend only on the ARC? How the
> heck does that happen - I thought all files were represented as vm
> objects?
For most other filesystems (and oversimplifying rather) the vnode ops 
for read/write are implemented by mapping the file into the kernel 
address space (via the segmap segment) and doing bcopy() operations 
between user buffers and the page cache (which in the case of reads 
causes page faults leading to getpage vnode ops). If some form of direct 
I/O is in place, segmap (and therefore, the page cache) is bypassed.

In the case of ZFS the vnode ops for read/write call into the ZPL (ZFS 
POSIX Layer) unless the file is also mapped (in which case segmap has to 
be involved again, and the icky glue between the page cache and the ARC 
comes into play).

Apparently Analagous Threads

Search for more reasonably related threads

zfs discuss - Dec 2010 - relationship between ARC and page cache

[zfs-discuss] relationship between ARC and page cache

[zfs-discuss] relationship between ARC and page cache

[zfs-discuss] relationship between ARC and page cache

[zfs-discuss] relationship between ARC and page cache

[zfs-discuss] relationship between ARC and page cache

[zfs-discuss] relationship between ARC and page cache

Apparently Analagous Threads