Ricardo Correia
2007-Jan-10 00:39 UTC
[zfs-code] Request for help and advice on cache behaviour
Hi, I''m not sure how to control the ARC on the ZFS port to FUSE. In the alpha1 release, for testing, I simply set the zfs_arc_max and zfs_arc_min variables to 80 MBs and 64 MBs (respectively) to prevent the ARC from growing unboundedly. However, I''m having a problem. A simple run of the following script will cause zfs-fuse memory usage to grow almost indefinitely: for i in `seq 1 100000`; do touch /pool/testdir/$i done The problem seems to be that vnodes are getting allocated and never freed. From what I understand, and from what I read in the previous thread about a similar issue that Pawel was having, this is what happens in Solaris (and in zfs-fuse, by extension): 1) When VN_RELE() is called and vp->v_count reaches 1, VOP_INACTIVE() is called. 2) VOP_INACTIVE() calls zfs_inactive() which calls zfs_zinactive(). 3) zfs_zinactive() calls dmu_buf_rele() 4) ?? 5) znode_pageout_func() calls zfs_znode_free() which finally frees the vnode. As for step 4, Mark Maybee mentioned: "Note that the db_immediate_evict == 0 means that you will probably *not* see a callback to the pageout function immediately. This is the general case. We hold onto the znode (and related memory) until the associated disk blocks are evicted from the cache (arc). The cache is likely to hold onto that data until either: - we encounter memory shortage, and so reduce the cache size - we read new data into the cache, and evict this data to make space for it." So even if I have a "not very big" cache, there can be a lot of alloc''ed vnodes which consume a lot more memory! Of course, if the ARC would somehow take that memory in account when checking zfs_arc_max it would be easier to tune it. So, the better question is: is the ARC even helpful for a FUSE filesystem? I mean, the Linux kernel is already caching file data, even for FUSE filesystems. Maybe I should try to disable the ARC? Or set it to a very small maximum size? Or should I have a thread that monitors memory usage and calls arc_kmem_reclaim() when it reaches a certain point? If this is the case, I don''t know how to determine what is considered excessive memory usage, since we could be running in a computer with as little or as much memory as.. well, Linux can handle. PS: I''m also considering opening the vdevs (the underlying block devices or files) with O_DIRECT, otherwise the kernel will also cache them. Does it make sense? Thank you in advance.
Pawel Jakub Dawidek
2007-Jan-10 05:23 UTC
[zfs-code] Request for help and advice on cache behaviour
On Wed, Jan 10, 2007 at 12:39:57AM +0000, Ricardo Correia wrote:> Hi, > > I''m not sure how to control the ARC on the ZFS port to FUSE. > > In the alpha1 release, for testing, I simply set the zfs_arc_max and > zfs_arc_min variables to 80 MBs and 64 MBs (respectively) to prevent the ARC > from growing unboundedly. > > However, I''m having a problem. A simple run of the following script will cause > zfs-fuse memory usage to grow almost indefinitely: > > for i in `seq 1 100000`; > do > touch /pool/testdir/$i > done > > The problem seems to be that vnodes are getting allocated and never freed. > > From what I understand, and from what I read in the previous thread about a > similar issue that Pawel was having, this is what happens in Solaris (and in > zfs-fuse, by extension): > > 1) When VN_RELE() is called and vp->v_count reaches 1, VOP_INACTIVE() is > called. > 2) VOP_INACTIVE() calls zfs_inactive() which calls zfs_zinactive(). > 3) zfs_zinactive() calls dmu_buf_rele() > 4) ?? > 5) znode_pageout_func() calls zfs_znode_free() which finally frees the vnode. > > As for step 4, Mark Maybee mentioned: > > "Note that the db_immediate_evict == 0 means that you > will probably *not* see a callback to the pageout function immediately. > This is the general case. We hold onto the znode (and related memory) > until the associated disk blocks are evicted from the cache (arc). The > cache is likely to hold onto that data until either: > - we encounter memory shortage, and so reduce the cache size > - we read new data into the cache, and evict this data to > make space for it." > > So even if I have a "not very big" cache, there can be a lot of alloc''ed > vnodes which consume a lot more memory! > Of course, if the ARC would somehow take that memory in account when checking > zfs_arc_max it would be easier to tune it.I''m sorry to say that, but I''m happy to see you have the same problem:) Maybe it will be easier to solve it. This is one of the latest problems I''ve. I''ve spent a lot of time trying to solve it, but no luck. In FreeBSD, a subsystem can register vm_lowmem hook, which bascially means "call me if there is no free memory, so I may be able to free something". When someone allocates memory with M_WAITOK flag (equivalent of Solaris'' KM_SLEEP) and there is no free memory in the system, the allocator calls registered vm_lowmem hooks begging for memory. I register the hook in my port, so I''m called when there is no free memory. Then, I''m immediately waking the arc_reclaim_thread thread. It almost work, but is not reliable. Under high load it is possible that it won''t free enough memory (most of the time malloc(131072) is called then). I am able to reproduce it with ''iozone -a''.> So, the better question is: is the ARC even helpful for a FUSE filesystem? > I mean, the Linux kernel is already caching file data, even for FUSE > filesystems.I''m not sure if ZFS can work without ARC, but I''d suggest not to turn it off. I don''t know what algorithm Linux use for caching, but ARC is really nice and efficient.> PS: I''m also considering opening the vdevs (the underlying block devices or > files) with O_DIRECT, otherwise the kernel will also cache them. Does it make > sense?I''d guess it makes a lot of sense, cache duplication is not a good thing. -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20070110/8197e89c/attachment.bin>
Mark Maybee
2007-Jan-10 18:10 UTC
[zfs-code] Request for help and advice on cache behaviour
Ricardo Correia wrote:> Hi, > > I''m not sure how to control the ARC on the ZFS port to FUSE. > > In the alpha1 release, for testing, I simply set the zfs_arc_max and > zfs_arc_min variables to 80 MBs and 64 MBs (respectively) to prevent the ARC > from growing unboundedly. > > However, I''m having a problem. A simple run of the following script will cause > zfs-fuse memory usage to grow almost indefinitely: > > for i in `seq 1 100000`; > do > touch /pool/testdir/$i > done > > The problem seems to be that vnodes are getting allocated and never freed. > > From what I understand, and from what I read in the previous thread about a > similar issue that Pawel was having, this is what happens in Solaris (and in > zfs-fuse, by extension): > > 1) When VN_RELE() is called and vp->v_count reaches 1, VOP_INACTIVE() is > called. > 2) VOP_INACTIVE() calls zfs_inactive() which calls zfs_zinactive(). > 3) zfs_zinactive() calls dmu_buf_rele() > 4) ?? > 5) znode_pageout_func() calls zfs_znode_free() which finally frees the vnode. > > As for step 4, Mark Maybee mentioned: > > "Note that the db_immediate_evict == 0 means that you > will probably *not* see a callback to the pageout function immediately. > This is the general case. We hold onto the znode (and related memory) > until the associated disk blocks are evicted from the cache (arc). The > cache is likely to hold onto that data until either: > - we encounter memory shortage, and so reduce the cache size > - we read new data into the cache, and evict this data to > make space for it." > > So even if I have a "not very big" cache, there can be a lot of alloc''ed > vnodes which consume a lot more memory! > Of course, if the ARC would somehow take that memory in account when checking > zfs_arc_max it would be easier to tune it. >Its not just the vnodes, there are also znodes, dnodes and dbufs to consider. The bottom line is that 64 MB of vnode-related ARC data can tie up 192MB of other memory. We are in the process of making the ARC more aware of these extra overheads.> So, the better question is: is the ARC even helpful for a FUSE filesystem? > I mean, the Linux kernel is already caching file data, even for FUSE > filesystems. > > Maybe I should try to disable the ARC? Or set it to a very small maximum size? > > Or should I have a thread that monitors memory usage and calls > arc_kmem_reclaim() when it reaches a certain point? If this is the case, I > don''t know how to determine what is considered excessive memory usage, since > we could be running in a computer with as little or as much memory as.. well, > Linux can handle. >This is done on Solaris. We have a thread that monitors the free memory available on the machine, and tries to keep the ARC usage in check. In your FUSE implementation you may want to minimally track the size of the vnode/znode/dnode/dbuf cache usage. In general, this is just a hard problem. You want to keep as much data in the ARC as possible for best performance, while not impacting applications that need memory. The better integration you can get between the ARC and the VM system, the happier you will be. In the long term, Sun will likely migrate the ARC functionality completely into the VM system. -Mark
Yuen L. Lee
2007-Jan-14 07:23 UTC
[zfs-code] Re: Request for help and advice on cache behaviour
> Ricardo Correia wrote: > > Hi, > > > > I''m not sure how to control the ARC on the ZFS port > to FUSE. > > > > In the alpha1 release, for testing, I simply set > the zfs_arc_max and > > zfs_arc_min variables to 80 MBs and 64 MBs > (respectively) to prevent the ARC > > from growing unboundedly. > > > > However, I''m having a problem. A simple run of the > following script will cause > > zfs-fuse memory usage to grow almost indefinitely: > > > > for i in `seq 1 100000`; > > do > > touch /pool/testdir/$i > > done > > > > The problem seems to be that vnodes are getting > allocated and never freed. > > > > From what I understand, and from what I read in the > previous thread about a > > similar issue that Pawel was having, this is what > happens in Solaris (and in > > zfs-fuse, by extension): > > > > 1) When VN_RELE() is called and vp->v_count reaches > 1, VOP_INACTIVE() is > > called. > > 2) VOP_INACTIVE() calls zfs_inactive() which calls > zfs_zinactive(). > > 3) zfs_zinactive() calls dmu_buf_rele() > > 4) ?? > > 5) znode_pageout_func() calls zfs_znode_free() > which finally frees the vnode. > > > > As for step 4, Mark Maybee mentioned: > > > > "Note that the db_immediate_evict == 0 means that > you > > will probably *not* see a callback to the pageout > function immediately. > > This is the general case. We hold onto the znode > (and related memory) > > until the associated disk blocks are evicted from > the cache (arc). The > > cache is likely to hold onto that data until > either: > > - we encounter memory shortage, and so > reduce the cache size > > - we read new data into the cache, and > evict this data to > > make space for it." > > > > So even if I have a "not very big" cache, there can > be a lot of alloc''ed > > vnodes which consume a lot more memory! > > Of course, if the ARC would somehow take that > memory in account when checking > > zfs_arc_max it would be easier to tune it. > > > Its not just the vnodes, there are also znodes, > dnodes and dbufs to > consider. The bottom line is that 64 MB of > vnode-related ARC data can > tie up 192MB of other memory. We are in the process > of making the ARC > more aware of these extra overheads. > > > So, the better question is: is the ARC even helpful > for a FUSE filesystem? > > I mean, the Linux kernel is already caching file > data, even for FUSE > > filesystems. > > > > Maybe I should try to disable the ARC? Or set it to > a very small maximum size? > > > > Or should I have a thread that monitors memory > usage and calls > > arc_kmem_reclaim() when it reaches a certain point? > If this is the case, I > > don''t know how to determine what is considered > excessive memory usage, since > > we could be running in a computer with as little or > as much memory as.. well, > > Linux can handle. > > > This is done on Solaris. We have a thread that > monitors the free memory > available on the machine, and tries to keep the ARC > usage in check. In > your FUSE implementation you may want to minimally > track the size of the > vnode/znode/dnode/dbuf cache usage. In general, this > is just a hard > problem. You want to keep as much data in the ARC as > possible for best > performance, while not impacting applications that > need memory. The > better integration you can get between the ARC and > the VM system, the > happier you will be. In the long term, Sun will > likely migrate the > ARC functionality completely into the VM system.Will this make the zfs more tightly coupled with the Solaris platform? It will mean more obstacles for the engineers who are trying to make zfs a platform-independent filesystem. Yuen> > -Mark > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://opensolaris.org/mailman/listinfo/zfs-code >-- This messages posted from opensolaris.org