Hello, For desktop use, and presumably rapidly changing non-desktop uses, I find the ARC cache pretty annoying in its behavior. For example this morning I had to hit my launch-terminal key perhaps 50 times (roughly) before it would start completing without disk I/O. There are plenty of other examples as well, such as /var/db/pkg not being pulled aggressively into cache such that pkg_* operations (this is on FreeBSD) are slower than they should (I have to run "pkg_info" some number of times before *it* will complete without disk I/O too). I would be perfectly happy with pure LRU caching behavior or an approximation thereof, and would therefore like to essentially completely turn off all MFU-like weighting. I have not investigated in great depth so it''s possible this represents an implementation problem rather than the actual intended policy of the ARC. If the former, can someone confirm/deny? If the latter, is there some way to tweak it? I have not found one (other than changing the code). Is there any particular reason why such knobs are not exposed? Am I missing something? -- / Peter Schuller
On Mon, 5 Apr 2010, Peter Schuller wrote:> For desktop use, and presumably rapidly changing non-desktop uses, I > find the ARC cache pretty annoying in its behavior. For example this > morning I had to hit my launch-terminal key perhaps 50 times (roughly) > before it would start completing without disk I/O. There are plenty of > other examples as well, such as /var/db/pkg not being pulled > aggressively into cache such that pkg_* operations (this is on > FreeBSD) are slower than they should (I have to run "pkg_info" some > number of times before *it* will complete without disk I/O too).It sounds like you are complaining about how FreeBSD has implemented zfs in the system rather than about zfs in general. These problems don''t occur under Solaris. Zfs and the kernel need to agree on how to allocate/free memory, and it seems that Solaris is more advanced than FreeBSD in this area. It is my understanding that FreeBSD offers special zfs tunables to adjust zfs memory usage. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> It sounds like you are complaining about how FreeBSD has implemented zfs in > the system rather than about zfs in general. ?These problems don''t occur > under Solaris. ?Zfs and the kernel need to agree on how to allocate/free > memory, and it seems that Solaris is more advanced than FreeBSD in this > area. ?It is my understanding that FreeBSD offers special zfs tunables to > adjust zfs memory usage.It may be FreeBSD specific, but note that I a not talking about the amount of memory dedicated to the ARC and how it balances with free memory on the system. I am talking about eviction policy. I could be wrong but I didn''t think ZFS port made significant changes there. And note that part the of *point* of the ARC (at least according to the original paper, though it was a while since I read it), as opposed to a pure LRU, is to do some weighting on frequency of access, which is exactly consistent with what I''m observing (very quick eviction and/or lack of inserton of data, particularly in the face of unrelated long-term I/O having happened in the background). It would likely also be the desired behavior for longer-running homogenous disk access patterns where optimal use of cache over long period may be more important than immediately reacting to a changing access pattern. So it''s not like there is no reason to believe this can be about ARC policy. Why would this *not* occurr on Solaris? It seems to me that it would imply the ARC was broken on Solaris, since it is not *supposed* to be a pure LRU by design. Again, there may very well be a FreeBSD specific issue here that is altering the behavior, and maybe the extremity of it that I am reporting is not supposed to be happening, but I believe the issue is more involved than what you''re implying in your response. -- / Peter Schuller
On Mon, 5 Apr 2010, Peter Schuller wrote:> > It may be FreeBSD specific, but note that I a not talking about the > amount of memory dedicated to the ARC and how it balances with free > memory on the system. I am talking about eviction policy. I could be > wrong but I didn''t think ZFS port made significant changes there.The ARC is designed to use as much memory as is available up to a limit. If the kernel allocator needs memory and there is none available, then the allocator requests memory back from the zfs ARC. Note that some systems have multiple memory allocators. For example, there may be a memory allocator for the network stack, and/or for a filesystem. The FreeBSD kernel is not the same as Solaris. While Solaris uses a common allocator between most of the kernel and zfs, FreeBSD may use different allocators, which are not able to share memory. The space available for zfs might be pre-allocated. I assume that you have already read the FreeBSD ZFS tuning guide (http://wiki.freebsd.org/ZFSTuningGuide) and the ZFS filesystem section in the handbook (http://www.freebsd.org/doc/handbook/filesystems-zfs.html) and made sure that your system is tuned appropriately.> Why would this *not* occurr on Solaris? It seems to me that it would > imply the ARC was broken on Solaris, since it is not *supposed* to be > a pure LRU by design. Again, there may very well be a FreeBSD specific > issue here that is altering the behavior, and maybe the extremity of > it that I am reporting is not supposed to be happening, but I believe > the issue is more involved than what you''re implying in your response.There have been a lot of eyeballs looking at how zfs does its caching, and a ton of benchmarks (mostly focusing on server thoughput) to verify the design. While there can certainly be zfs shortcomings (I have found several) these are few and far between. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> The ARC is designed to use as much memory as is available up to a limit. ?If > the kernel allocator needs memory and there is none available, then the > allocator requests memory back from the zfs ARC. Note that some systems have > multiple memory allocators. ?For example, there may be a memory allocator > for the network stack, and/or for a filesystem.Yes, but again I am concerned with what the ARC chooses to cache and for how long, not how the ARC balances memory with other parts of the kernel. At least, none of my observations lead me to believe the latter is the problem here.> might be pre-allocated. ?I assume that you have already read the FreeBSD ZFS > tuning guide (http://wiki.freebsd.org/ZFSTuningGuide) and the ZFS filesystem > section in the handbook > (http://www.freebsd.org/doc/handbook/filesystems-zfs.html) and made sure > that your system is tuned appropriately.Yes, I have been tweaking and fiddling and reading off and on since ZFS was originally added to CURRENT. This is not about tuning in that sense. The fact that the little data necessary to start an ''urxvt'' instance does not get cached for at least 1-2 seconds on an otherwise mostly idle system is either the result of cache policy, an implementation bug (freebsd or otherwise), or a matter of an *extremely* small cache size. I have observed this behavior for a very long time across versions of both ZFS and FreeBSD, and with different forms of arc sizing tweaks. It''s entirely possibly there are FreeBSD issues preventing the ARC to size itself appropriately. What I am saying though is that all indications are that data is not being selected for caching at all, or else is evicted extremely quickly, unless sufficient "frequency" has been accumulated to, presumably, make the ARC decide to cache the data. This is entirely what I would expect from a caching policy that tries to adapt to long-term access patterns and avoid pre-mature cache eviction by looking at frequency of access. I don''t see what it is that is so outlandish about my query. These are fundamental ways in which caches of different types behave, and there is a legitimate reason to not use the same cache eviction policy under all possible workloads. The behavior I am seeing is consistent with a caching policy that tries "too hard" (for my particular use case) to avoid eviction in the face of short-term changes in access pattern.> There have been a lot of eyeballs looking at how zfs does its caching, and a > ton of benchmarks (mostly focusing on server thoughput) to verify the > design. ?While there can certainly be zfs shortcomings (I have found > several) these are few and far between.That''s a very general statement. I am talking about specifics here. For example, you can have mountains of evidence that shows that a plain LRU is "optimal" (under some conditions). That doesn''t change the fact that if I want to avoid a sequential scan of a huge data set to completely evict everything in the cache, I cannot use a plain LRU. In this case I''m looking for the reverse; i.e., increasing the importance of ''recenticity'' because my workload is such that it would be more optimal than the behavior I am observing. Benchmarks are irrelevant except insofar as they show that my problem is not with the caching policy, since I am trying to address an empirically observed behavior. I *will* try to look at how the ARC sizes itself, as I''m unclear on several things in the way memory is being reported by FreeBSD, but as far as I can tell these are different issues. Sure, a bigger ARC might hide the behavior I happen to see; but I want the cache to behave in a way where I do not need gigabytes of extra ARC size to "lure" it into caching the data necessary for ''urxvt'' without having to start it 50 times in a row to accumulate statistics. -- / Peter Schuller
On Apr 5, 2010, at 2:23 PM, Peter Schuller wrote:> That''s a very general statement. I am talking about specifics here. > For example, you can have mountains of evidence that shows that a > plain LRU is "optimal" (under some conditions). That doesn''t change > the fact that if I want to avoid a sequential scan of a huge data set > to completely evict everything in the cache, I cannot use a plain LRU.In simple terms, the ARC is divided into a MRU and MFU side. target size (c) = target MRU size (p) + target MFU size (c-p) On Solaris, to get from the MRU to the MFU side, the block must be read at least once in 62.5 milliseconds. For pure read-once workloads, the data won''t to the MFU side and the ARC will behave exactly like an (adaptable) MRU cache.> In this case I''m looking for the reverse; i.e., increasing the > importance of ''recenticity'' because my workload is such that it would > be more optimal than the behavior I am observing. Benchmarks are > irrelevant except insofar as they show that my problem is not with the > caching policy, since I am trying to address an empirically observed > behavior. > > I *will* try to look at how the ARC sizes itself, as I''m unclear on > several things in the way memory is being reported by FreeBSD, but as > far as I can tell these are different issues. Sure, a bigger ARC might > hide the behavior I happen to see; but I want the cache to behave in a > way where I do not need gigabytes of extra ARC size to "lure" it into > caching the data necessary for ''urxvt'' without having to start it 50 > times in a row to accumulate statistics.I''m not convinced you have attributed the observation to the ARC behaviour. Do you have dtrace (or other) data to explain what process is causing the physical I/Os? -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
> In simple terms, the ARC is divided into a MRU and MFU side. > ? ? ? ?target size (c) = target MRU size (p) + target MFU size (c-p) > > On Solaris, to get from the MRU to the MFU side, the block must be > read at least once in 62.5 milliseconds. ?For pure read-once workloads, > the data won''t to the MFU side and the ARC will behave exactly like an > (adaptable) MRU cache.Ok. That differs significantly from my understanding, though in retrospect I should have realized it given that arc stats contain only references to mru and mfu... I previously was under the impression that the ZFS ARC had an LRU:Ish side to complement the MFU side. MRU+MFU changes things. I will have to look into it in better detail to understand the consequences. Is there a paper that describes the ARC as it is implemented in ZFS (since it clearly diverges from the IBM ARC)?>> I *will* try to look at how the ARC sizes itself, as I''m unclear on >> several things in the way memory is being reported by FreeBSD, but asFor what it''s worth I confirmed that the ARC was too small and that there are clearly remaining issues with the interaction between the ARC the rest of the FreeBSD kernel. (I wasn''t sure before but I confirmed I Was looking at the right number.) I''ll try to monitor more carefully and see if I can figure out when the ARC shrinks and why it doesn''t grow back. Informally my observations have always been that things behave great for a while after boot, but degenerate over time. In this case it was sitting at it''s minium size, which was 214M. I realize this is far below what is recommended or even designed for, but is it clearly caching *something* and I clearly *could* make it cache urxvt+deps by re-running it several tens of times in rapid succession.> I''m not convinced you have attributed the observation to the ARC > behaviour. ?Do you have dtrace (or other) data to explain what process > is causing the physical I/Os?In the urxvt case, I am basing my claim on informal observations. I.e., "hit terminal launch key, wait for disks to rattle, get my terminal". Repeat. Only by repeating it very many times in very rapid succession am I able to coerce it to be cached such that I can immediately get my terminal. And what I mean by that is that it keeps necessitating disk I/O for a long time, even on rapid successive invocations. But once I have repeated it enough times it seems to finally enter the cache. (No dtrace unfortunately. I confess to not having learned dtrace yet, in spite of thinking it''s massively cool.) However, I will of course accept that given the minimal ARC size at the time I am moving completely away from the designed-for use-case. And if that is responsible, it is of course my own fault. Given MRU+MFU I''ll have to back off with my claims. Under the (incorrect) assumption of LRU+MFU I felt the behavior was unexpected, even with a small cache size. Given MRU+MFU and without knowing further details right now, I accept that the ARC may fundamentally need a bigger cache size in relation to the working set in order to be effective in the way I am using it here. I was basing my expectations on LRU-style behavior. Thanks! -- / Peter Schuller
On 04/05/10 15:24, Peter Schuller wrote:> In the urxvt case, I am basing my claim on informal observations. > I.e., "hit terminal launch key, wait for disks to rattle, get my > terminal". Repeat. Only by repeating it very many times in very rapid > succession am I able to coerce it to be cached such that I can > immediately get my terminal. And what I mean by that is that it keeps > necessitating disk I/O for a long time, even on rapid successive > invocations. But once I have repeated it enough times it seems to > finally enter the cache.Are you sure you''re not seeing unrelated disk update activity like atime updates, mtime updates on pseudo-terminals, etc., ? I''d want to start looking more closely at I/O traces (dtrace can be very helpful here) before blaming any specific system component for the unexpected I/O. - Bill
On Apr 5, 2010, at 3:24 PM, Peter Schuller wrote:> I will have to look into it in better detail to understand the > consequences. Is there a paper that describes the ARC as it is > implemented in ZFS (since it clearly diverges from the IBM ARC)?There are various blogs, but perhaps the best documentation is in the comments starting at http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#25 -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
Hi,>> In simple terms, the ARC is divided into a MRU and MFU side. >> target size (c) = target MRU size (p) + target MFU size (c-p) >> >> On Solaris, to get from the MRU to the MFU side, the block must be >> read at least once in 62.5 milliseconds. For pure read-once workloads, >> the data won''t to the MFU side and the ARC will behave exactly like an >> (adaptable) MRU cache. >>Richard, I am looking at the code that moves a buffer from MRU to MFU, and as I read it, if the block is read and the time is greater than 62 milliseconds, it moves from MRU to MFU (lines ~2256 to ~2265 in arc.c). Also, I have a program that reads the same block once every 5 seconds, and on a relatively idle machine, I can find the block in the MFU, not the MRU (using mdb). If the block is read again in less than 62 milliseconds, it stays in the MRU. max
I realized I forgot to follow-up on this thread. Just to be clear, I have confirmed that I am seeing what to me is undesirable behavior even with the ARC being 1500 MB in size on an almost idle system (< 0.5 mb/sec read load, almost 0 write load). Observe these recursive searches through /usr/src/sys: % time ack arcstats 2>/dev/null 1>/dev/null ack arcstats 2.74s user 1.19s system 20% cpu 19.143 total % time ack arcstats 2>/dev/null 1>/dev/null ack arcstats 2> /dev/null > /dev/null 2.45s user 0.51s system 99% cpu 2.986 total % time ack arcstats 2>/dev/null 1>/dev/null ack arcstats 2> /dev/null > /dev/null 2.41s user 0.62s system 53% cpu 5.667 total % time ack arcstats 2>/dev/null 1>/dev/null ack arcstats 2> /dev/null > /dev/null 2.37s user 0.68s system 50% cpu 6.025 total % time ack arcstats 2>/dev/null 1>/dev/null ack arcstats 2> /dev/null > /dev/null 2.45s user 0.61s system 45% cpu 6.694 total % time ack arcstats 2>/dev/null 1>/dev/null ack arcstats 2> /dev/null > /dev/null 2.45s user 0.59s system 53% cpu 5.651 total % time ack arcstats 2>/dev/null 1>/dev/null ack arcstats 2> /dev/null > /dev/null 2.32s user 0.72s system 46% cpu 6.503 total % time ack arcstats 2>/dev/null 1>/dev/null ack arcstats 2> /dev/null > /dev/null 2.41s user 0.66s system 44% cpu 6.843 total % time ack arcstats 2>/dev/null 1>/dev/null ack arcstats 2> /dev/null > /dev/null 2.37s user 0.67s system 49% cpu 6.119 total The first was entirely cold. For some reason the second was close to CPU-bound, while the remainder were significantly disk-bound even if not to the extent of the initial run. I correlated with ''iostat -x 1'' to confirm that I am in fact generating I/O (but no, I do not have dtrace output). Anyways, presumably the answer to my original question is no and the above isn''t really very interesting other than to show that under some circumstances you can see behavior that is decidedly non-optimal for interactive desktop use of certain kinds. Whether this is ARC in general or something FreeBSD specific, I don''t know. But it does, at this point, not have to do with ARC sizing since the ARC is sensibly large. (I realize I should investigate properly and report back, but I''m not likely to have time to dig into this now.) -- / Peter Schuller