thr3ads.net - zfs discuss - [zfs-discuss] Tuning the ARC towards LRU [Apr 2010]

If this information is useful, please help other people find it:
Share via:

Peter Schuller

2010-Apr-05 08:12 UTC

[zfs-discuss] Tuning the ARC towards LRU

Hello,

For desktop use, and presumably rapidly changing non-desktop uses, I
find the ARC cache pretty annoying in its behavior. For example this
morning I had to hit my launch-terminal key perhaps 50 times (roughly)
before it would start completing without disk I/O. There are plenty of
other examples as well, such as /var/db/pkg not being pulled
aggressively into cache such that pkg_* operations (this is on
FreeBSD) are slower than they should (I have to run "pkg_info" some
number of times before *it* will complete without disk I/O too).

I would be perfectly happy with pure LRU caching behavior or an
approximation thereof, and would therefore like to essentially
completely turn off all MFU-like weighting.

I have not investigated in great depth so it''s possible this
represents an implementation problem rather than the actual intended
policy of the ARC. If the former, can someone confirm/deny? If the
latter, is there some way to tweak it? I have not found one (other
than changing the code). Is there any particular reason why such knobs
are not exposed? Am I missing something?

-- 
/ Peter Schuller

Bob Friesenhahn

2010-Apr-05 17:46 UTC

head link

[zfs-discuss] Tuning the ARC towards LRU

On Mon, 5 Apr 2010, Peter Schuller wrote:
> For desktop use, and presumably rapidly changing non-desktop uses, I
> find the ARC cache pretty annoying in its behavior. For example this
> morning I had to hit my launch-terminal key perhaps 50 times (roughly)
> before it would start completing without disk I/O. There are plenty of
> other examples as well, such as /var/db/pkg not being pulled
> aggressively into cache such that pkg_* operations (this is on
> FreeBSD) are slower than they should (I have to run "pkg_info"
some
> number of times before *it* will complete without disk I/O too).
It sounds like you are complaining about how FreeBSD has implemented 
zfs in the system rather than about zfs in general.  These problems 
don''t occur under Solaris.  Zfs and the kernel need to agree on how to 
allocate/free memory, and it seems that Solaris is more advanced than 
FreeBSD in this area.  It is my understanding that FreeBSD offers 
special zfs tunables to adjust zfs memory usage.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Peter Schuller

2010-Apr-05 18:15 UTC

head link

[zfs-discuss] Tuning the ARC towards LRU

> It sounds like you are complaining about how FreeBSD has implemented zfs in
> the system rather than about zfs in general. ?These problems don''t
occur
> under Solaris. ?Zfs and the kernel need to agree on how to allocate/free
> memory, and it seems that Solaris is more advanced than FreeBSD in this
> area. ?It is my understanding that FreeBSD offers special zfs tunables to
> adjust zfs memory usage.
It may be FreeBSD specific, but note that I a not talking about the
amount of memory dedicated to the ARC and how it balances with free
memory on the system. I am talking about eviction policy. I could be
wrong but I didn''t think ZFS port made significant changes there.

And note that part the of *point* of the ARC (at least according to
the original paper, though it was a while since I read it), as opposed
to a pure LRU, is to do some weighting on frequency of access, which
is exactly consistent with what I''m observing (very quick eviction
and/or lack of inserton of data, particularly in the face of unrelated
long-term I/O having happened in the background). It would likely also
be the desired behavior for longer-running homogenous disk access
patterns where optimal use of cache over long period may be more
important than immediately reacting to a changing access pattern. So
it''s not like there is no reason to believe this can be about ARC
policy.

Why would this *not* occurr on Solaris? It seems to me that it would
imply the ARC was broken on Solaris, since it is not *supposed* to be
a pure LRU by design. Again, there may very well be a FreeBSD specific
issue here that is altering the behavior, and maybe the extremity of
it that I am reporting is not supposed to be happening, but I believe
the issue is more involved than what you''re implying in your response.

-- 
/ Peter Schuller

Bob Friesenhahn

2010-Apr-05 20:05 UTC

head link

[zfs-discuss] Tuning the ARC towards LRU

On Mon, 5 Apr 2010, Peter Schuller wrote:>
> It may be FreeBSD specific, but note that I a not talking about the
> amount of memory dedicated to the ARC and how it balances with free
> memory on the system. I am talking about eviction policy. I could be
> wrong but I didn''t think ZFS port made significant changes there.
The ARC is designed to use as much memory as is available up to a 
limit.  If the kernel allocator needs memory and there is none 
available, then the allocator requests memory back from the zfs ARC. 
Note that some systems have multiple memory allocators.  For example, 
there may be a memory allocator for the network stack, and/or for 
a filesystem.

The FreeBSD kernel is not the same as Solaris.  While Solaris uses a 
common allocator between most of the kernel and zfs, FreeBSD may use 
different allocators, which are not able to share memory.  The space 
available for zfs might be pre-allocated.  I assume that you have 
already read the FreeBSD ZFS tuning guide 
(http://wiki.freebsd.org/ZFSTuningGuide) and the ZFS filesystem 
section in the handbook 
(http://www.freebsd.org/doc/handbook/filesystems-zfs.html) and made 
sure that your system is tuned appropriately.
> Why would this *not* occurr on Solaris? It seems to me that it would
> imply the ARC was broken on Solaris, since it is not *supposed* to be
> a pure LRU by design. Again, there may very well be a FreeBSD specific
> issue here that is altering the behavior, and maybe the extremity of
> it that I am reporting is not supposed to be happening, but I believe
> the issue is more involved than what you''re implying in your
response.
There have been a lot of eyeballs looking at how zfs does its caching, 
and a ton of benchmarks (mostly focusing on server thoughput) to 
verify the design.  While there can certainly be zfs shortcomings (I 
have found several) these are few and far between.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Peter Schuller

2010-Apr-05 21:23 UTC

head link

[zfs-discuss] Tuning the ARC towards LRU

> The ARC is designed to use as much memory as is available up to a limit.
?If
> the kernel allocator needs memory and there is none available, then the
> allocator requests memory back from the zfs ARC. Note that some systems
have
> multiple memory allocators. ?For example, there may be a memory allocator
> for the network stack, and/or for a filesystem.
Yes, but again I am concerned with what the ARC chooses to cache and
for how long, not how the ARC balances memory with other parts of the
kernel. At least, none of my observations lead me to believe the
latter is the problem here.
> might be pre-allocated. ?I assume that you have already read the FreeBSD
ZFS
> tuning guide (http://wiki.freebsd.org/ZFSTuningGuide) and the ZFS
filesystem
> section in the handbook
> (http://www.freebsd.org/doc/handbook/filesystems-zfs.html) and made sure
> that your system is tuned appropriately.
Yes, I have been tweaking and fiddling and reading off and on since
ZFS was originally added to CURRENT.

This is not about tuning in that sense. The fact that the little data
necessary to start an ''urxvt'' instance does not get cached for
at
least 1-2 seconds on an otherwise mostly idle system is either the
result of cache policy, an implementation bug (freebsd or otherwise),
or a matter of an *extremely* small cache size. I have observed this
behavior for a very long time across versions of both ZFS and FreeBSD,
and with different forms of arc sizing tweaks.

It''s entirely possibly there are FreeBSD issues preventing the ARC to
size itself appropriately. What I am saying though is that all
indications are that data is not being selected for caching at all, or
else is evicted extremely quickly, unless sufficient "frequency" has
been accumulated to, presumably, make the ARC decide to cache the
data.

This is entirely what I would expect from a caching policy that tries
to adapt to long-term access patterns and avoid pre-mature cache
eviction by looking at frequency of access. I don''t see what it is
that is so outlandish about my query. These are fundamental ways in
which caches of different types behave, and there is a legitimate
reason to not use the same cache eviction policy under all possible
workloads. The behavior I am seeing is consistent with a caching
policy that tries "too hard" (for my particular use case) to avoid
eviction in the face of short-term changes in access pattern.
> There have been a lot of eyeballs looking at how zfs does its caching, and
a
> ton of benchmarks (mostly focusing on server thoughput) to verify the
> design. ?While there can certainly be zfs shortcomings (I have found
> several) these are few and far between.
That''s a very general statement. I am talking about specifics here.
For example, you can have mountains of evidence that shows that a
plain LRU is "optimal" (under some conditions). That doesn''t
change
the fact that if I want to avoid a sequential scan of a huge data set
to completely evict everything in the cache, I cannot use a plain LRU.

In this case I''m looking for the reverse; i.e., increasing the
importance of ''recenticity'' because my workload is such that
it would
be more optimal than the behavior I am observing. Benchmarks are
irrelevant except insofar as they show that my problem is not with the
caching policy, since I am trying to address an empirically observed
behavior.

I *will* try to look at how the ARC sizes itself, as I''m unclear on
several things in the way memory is being reported by FreeBSD, but as
far as I can tell these are different issues. Sure, a bigger ARC might
hide the behavior I happen to see; but I want the cache to behave in a
way where I do not need gigabytes of extra ARC size to "lure" it into
caching the data necessary for ''urxvt'' without having to start
it 50
times in a row to accumulate statistics.

-- 
/ Peter Schuller

Richard Elling

2010-Apr-05 21:39 UTC

head link

[zfs-discuss] Tuning the ARC towards LRU

On Apr 5, 2010, at 2:23 PM, Peter Schuller wrote:> That''s a very general statement. I am talking about specifics
here.
> For example, you can have mountains of evidence that shows that a
> plain LRU is "optimal" (under some conditions). That
doesn''t change
> the fact that if I want to avoid a sequential scan of a huge data set
> to completely evict everything in the cache, I cannot use a plain LRU.
In simple terms, the ARC is divided into a MRU and MFU side.
	target size (c) = target MRU size (p) + target MFU size (c-p)

On Solaris, to get from the MRU to the MFU side, the block must be 
read at least once in 62.5 milliseconds.  For pure read-once workloads, 
the data won''t to the MFU side and the ARC will behave exactly like an 
(adaptable) MRU cache.
> In this case I''m looking for the reverse; i.e., increasing the
> importance of ''recenticity'' because my workload is such
that it would
> be more optimal than the behavior I am observing. Benchmarks are
> irrelevant except insofar as they show that my problem is not with the
> caching policy, since I am trying to address an empirically observed
> behavior.
> 
> I *will* try to look at how the ARC sizes itself, as I''m unclear
on
> several things in the way memory is being reported by FreeBSD, but as
> far as I can tell these are different issues. Sure, a bigger ARC might
> hide the behavior I happen to see; but I want the cache to behave in a
> way where I do not need gigabytes of extra ARC size to "lure" it
into
> caching the data necessary for ''urxvt'' without having to
start it 50
> times in a row to accumulate statistics.
I''m not convinced you have attributed the observation to the ARC
behaviour.  Do you have dtrace (or other) data to explain what process
is causing the physical I/Os?
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Peter Schuller

2010-Apr-05 22:24 UTC

head link

[zfs-discuss] Tuning the ARC towards LRU

> In simple terms, the ARC is divided into a MRU and MFU side.
> ? ? ? ?target size (c) = target MRU size (p) + target MFU size (c-p)
>
> On Solaris, to get from the MRU to the MFU side, the block must be
> read at least once in 62.5 milliseconds. ?For pure read-once workloads,
> the data won''t to the MFU side and the ARC will behave exactly
like an
> (adaptable) MRU cache.
Ok. That differs significantly from my understanding, though in
retrospect I should have realized it given that arc stats contain only
references to mru and mfu... I previously was under the impression
that the ZFS ARC had an LRU:Ish side to complement the MFU side.
MRU+MFU changes things.

I will have to look into it in better detail to understand the
consequences. Is there a paper that describes the ARC as it is
implemented in ZFS (since it clearly diverges from the IBM ARC)?
>> I *will* try to look at how the ARC sizes itself, as I''m
unclear on
>> several things in the way memory is being reported by FreeBSD, but as
For what it''s worth I confirmed that the ARC was too small and that
there are clearly remaining issues with the interaction between the
ARC the rest of the FreeBSD kernel.  (I wasn''t sure before but I
confirmed I Was looking at the right number.) I''ll try to monitor more
carefully and see if I can figure out when the ARC shrinks and why it
doesn''t grow back. Informally my observations have always been that
things behave great for a while after boot, but degenerate over time.

In this case it was sitting at it''s minium size, which was 214M. I
realize this is far below what is recommended or even designed for,
but is it clearly caching *something* and I clearly *could* make it
cache urxvt+deps by re-running it several tens of times in rapid
succession.
> I''m not convinced you have attributed the observation to the ARC
> behaviour. ?Do you have dtrace (or other) data to explain what process
> is causing the physical I/Os?
In the urxvt case, I am basing my claim on informal observations.
I.e., "hit terminal launch key, wait for disks to rattle, get my
terminal". Repeat. Only by repeating it very many times in very rapid
succession am I able to coerce it to be cached such that I can
immediately get my terminal. And what I mean by that is that it keeps
necessitating disk I/O for a long time, even on rapid successive
invocations. But once I have repeated it enough times it seems to
finally enter the cache.

(No dtrace unfortunately. I confess to not having learned dtrace yet,
in spite of thinking it''s massively cool.)

However, I will of course accept that given the minimal ARC size at
the time I am moving completely away from the designed-for use-case.
And if that is responsible, it is of course my own fault. Given
MRU+MFU I''ll have to back off with my claims. Under the (incorrect)
assumption of LRU+MFU I felt the behavior was unexpected, even with a
small cache size. Given MRU+MFU and without knowing further details
right now, I accept that the ARC may fundamentally need a bigger cache
size in relation to the working set in order to be effective in the
way I am using it here. I was basing my expectations on LRU-style
behavior.

Thanks!

-- 
/ Peter Schuller

Bill Sommerfeld

2010-Apr-05 22:46 UTC

head link

[zfs-discuss] Tuning the ARC towards LRU

On 04/05/10 15:24, Peter Schuller wrote:> In the urxvt case, I am basing my claim on informal observations.
> I.e., "hit terminal launch key, wait for disks to rattle, get my
> terminal". Repeat. Only by repeating it very many times in very rapid
> succession am I able to coerce it to be cached such that I can
> immediately get my terminal. And what I mean by that is that it keeps
> necessitating disk I/O for a long time, even on rapid successive
> invocations. But once I have repeated it enough times it seems to
> finally enter the cache.
Are you sure you''re not seeing unrelated disk update activity like
atime
updates, mtime updates on pseudo-terminals, etc., ?

I''d want to start looking more closely at I/O traces (dtrace can be
very
helpful here) before blaming any specific system component for the 
unexpected I/O.

					- Bill

Richard Elling

2010-Apr-05 23:05 UTC

head link

[zfs-discuss] Tuning the ARC towards LRU

On Apr 5, 2010, at 3:24 PM, Peter Schuller wrote:> I will have to look into it in better detail to understand the
> consequences. Is there a paper that describes the ARC as it is
> implemented in ZFS (since it clearly diverges from the IBM ARC)?
There are various blogs, but perhaps the best documentation is in 
the comments starting at
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#25
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

max at bruningsystems.com

2010-Apr-06 06:14 UTC

head link

[zfs-discuss] Tuning the ARC towards LRU

Hi,>> In simple terms, the ARC is divided into a MRU and MFU side.
>>        target size (c) = target MRU size (p) + target MFU size (c-p)
>>
>> On Solaris, to get from the MRU to the MFU side, the block must be
>> read at least once in 62.5 milliseconds.  For pure read-once workloads,
>> the data won''t to the MFU side and the ARC will behave exactly
like an
>> (adaptable) MRU cache.
>>     Richard,
I am looking at the code that moves a buffer from MRU to MFU,
and as I read it, if the block is read and the time is greater than
62 milliseconds, it moves from MRU to MFU (lines ~2256 to ~2265
in arc.c).  Also, I have a program that reads the same block once every
5 seconds, and on a relatively idle machine, I can find the block in the
MFU, not the MRU (using mdb).  If the block is read again in less than 
62 milliseconds,
it stays in the MRU. 

max

Peter Schuller

2010-Apr-13 18:48 UTC

head link

[zfs-discuss] Tuning the ARC towards LRU

I realized I forgot to follow-up on this thread. Just to be clear, I
have confirmed that I am seeing what to me is undesirable behavior
even with the ARC being 1500 MB in size on an almost idle system (<
0.5 mb/sec read load, almost 0 write load). Observe these recursive
searches through /usr/src/sys:

% time ack arcstats 2>/dev/null 1>/dev/null
ack arcstats  2.74s user 1.19s system 20% cpu 19.143 total
% time ack arcstats 2>/dev/null 1>/dev/null
ack arcstats 2> /dev/null > /dev/null  2.45s user 0.51s system 99% cpu
2.986 total
% time ack arcstats 2>/dev/null 1>/dev/null
ack arcstats 2> /dev/null > /dev/null  2.41s user 0.62s system 53% cpu
5.667 total
% time ack arcstats 2>/dev/null 1>/dev/null
ack arcstats 2> /dev/null > /dev/null  2.37s user 0.68s system 50% cpu
6.025 total
% time ack arcstats 2>/dev/null 1>/dev/null
ack arcstats 2> /dev/null > /dev/null  2.45s user 0.61s system 45% cpu
6.694 total
% time ack arcstats 2>/dev/null 1>/dev/null
ack arcstats 2> /dev/null > /dev/null  2.45s user 0.59s system 53% cpu
5.651 total
% time ack arcstats 2>/dev/null 1>/dev/null
ack arcstats 2> /dev/null > /dev/null  2.32s user 0.72s system 46% cpu
6.503 total
% time ack arcstats 2>/dev/null 1>/dev/null
ack arcstats 2> /dev/null > /dev/null  2.41s user 0.66s system 44% cpu
6.843 total
% time ack arcstats 2>/dev/null 1>/dev/null
ack arcstats 2> /dev/null > /dev/null  2.37s user 0.67s system 49% cpu
6.119 total

The first was entirely cold. For some reason the second was close to
CPU-bound, while the remainder were significantly disk-bound even if
not to the extent of the initial run. I correlated with ''iostat -x
1''
to confirm that I am in fact generating I/O (but no, I do not have
dtrace output).

Anyways, presumably the answer to my original question is no and the
above isn''t really very interesting other than to show that under some
circumstances you can see behavior that is decidedly non-optimal for
interactive desktop use of certain kinds. Whether this is ARC in
general or something FreeBSD specific, I don''t know. But it does, at
this point, not have to do with ARC sizing since the ARC is sensibly
large.

(I realize I should investigate properly and report back, but I''m not
likely to have time to dig into this now.)

-- 
/ Peter Schuller

zfs discuss - Apr 2010 - Tuning the ARC towards LRU

[zfs-discuss] Tuning the ARC towards LRU

[zfs-discuss] Tuning the ARC towards LRU

[zfs-discuss] Tuning the ARC towards LRU

[zfs-discuss] Tuning the ARC towards LRU

[zfs-discuss] Tuning the ARC towards LRU

[zfs-discuss] Tuning the ARC towards LRU

[zfs-discuss] Tuning the ARC towards LRU

[zfs-discuss] Tuning the ARC towards LRU

[zfs-discuss] Tuning the ARC towards LRU

[zfs-discuss] Tuning the ARC towards LRU

[zfs-discuss] Tuning the ARC towards LRU