thr3ads.net - zfs discuss - [zfs-discuss] Help me understand ZFS caching [Apr 2007]

If this information is useful, please help other people find it:
Share via:

Tony Galway

2007-Apr-20 12:59 UTC

[zfs-discuss] Help me understand ZFS caching

I have a few questions regarding ZFS, and would appreciate if someone could
enlighten me as I work my way through.

First write cache.

If I look at traditional UFS / VxFS type file systems, they normally cache
metadata to RAM before flushing it to disk. This helps increase their perceived
write performance (perceived in the sense that if a power outage occurs, data
loss can occur).

ZFS on the other hand, performs copy-on-write to ensure that the disk is always
consistent, I see this as sort of being equivalent to using a directio option. I
understand that the data is written first, then the points are updated, but if I
were to use the directio analogy, would this be correct?

If that is the case, then is it true that ZFS really does not use a write cache
at all? And if it does, then how is it used?

Read Cache.

Any of us that have started using or benchmakring ZFS, have seen its voracious
appetite for memory, an appetite that is fully shared with VxFS for example, as
I am not singling out ZFS (I''m rather a fan).  On reboot of my T2000
test server (32GB Ram) I see that the arc cache max size is set to 30.88GB - a
sizeable piece of memory.

Now, is all that cache space only for read cache? (given my assumption regarding
write cache)

Tuneable Parameters:
I know that the philosophy of ZFS is that you should never have to tune your
file system, but  might I suggest, that tuning the FS is not always a bad thing.
You can''t expect a FS to be all things for all people. If there are
variables that can be modified to provide different performance characteristics
and profiles, then I would contend that it could strengthen ZFS and lead to
wider adoption and acceptance if you could, for example, limit the amount of
memory used by items like the cache without messing with c_max / c_min directly
in the kernel.

-Tony
 
 
This message posted from opensolaris.org

Tony Galway

2007-Apr-20 13:42 UTC

head link

[zfs-discuss] Re: Help me understand ZFS caching

Let me elaborate slightly on the reason I ask these questions.

I am performing some simple benchmarking, and during this a file is created by
sequentially writing 64k blocks until the 100Gb file is created. I am seeing,
and this is the exact same as VxFS, large pauses while the system reclaims the
memory that it has consumed.

I assume that since ZFS (back to the write cache question) is copy-on-write and
is not write caching anything (correct me if I am wrong), it is instead using
memory for my read-cache. Also, since I have 32Gb of memory the reclaim periods
are quite long while it frees this memory - basically rendering my volume
unusable until that memory is reclaimed.

With VxFS I was able to tune the file system with write_throttle, and this
allowed me to find a balance basically whereby the system writes crazy fast, and
then reclaims memory, and repeats that cycle.

I guess I could modify c_max in the kernel, to provide the same type of result,
but this is not a supported tuning practice - and thus I do not want to do that.

I am simply trying to determine where ZFS is different, the same, and where how
I can modify its default behaviours (or if I ever will).

Also, FYI, I''m testing on Solaris 10 11/06 (All testing must be
performed in production versions of Solaris) but if there are changes in Nevada
that will show me different results, I would be interested in those as an aside.
 
 
This message posted from opensolaris.org

Anton B. Rang

2007-Apr-20 13:47 UTC

head link

[zfs-discuss] Re: Help me understand ZFS caching

ZFS uses caching heavily as well; much more so, in fact, than UFS.

Copy-on-write and direct i/o are not related. As you say, data gets written
first, then the metadata which points to it, but this isn''t anything
like direct I/O. In particular, direct I/O avoids caching the data, instead
transferring it directly to/from user buffers, while ZFS-style copy-on-write
caches all data. ZFS does not have direct I/O at all right now.

One key difference between UFS & ZFS is that ZFS flushes the
drive''s write cache at key points. (It does this rather than using
ordered commands, even on SCSI disks, which to me is a little disappointing.)
This guarantees that the data is on-disk before the associated metadata. UFS
relies on keeping the write cache disabled to ensure that its journal is written
to disk before its metadata, again with the goal of keeping the file system
consistent at all times.

I agree with you on tuning. It''s clearly desirable that the
"out-of-box" settings for a file system work well for "general
purpose" loads; but there are almost always applications which require a
different strategy. This is much of why UFS/QFS/VxFS added direct i/o, and
it''s why VxFS (which focuses heavily on database) added quick i/o.
 
 
This message posted from opensolaris.org

Roch - PAE

2007-Apr-20 14:35 UTC

head link

[zfs-discuss] Help me understand ZFS caching

Tony Galway writes:

 > I have a few questions regarding ZFS, and would appreciate if someone
 > could enlighten me as I work my way through. 
 > 
 > First write cache.
 > 

We often use write cache to designate the cache present at
the disk level. Lets call this "disk write cache".
Most FS will cache information in host memory. Let''s call
this "FS cache". I think your questions are more about FS
cache behavior for differnet types of loads....


 > If I look at traditional UFS / VxFS type file systems, they normally
 > cache metadata to RAM before flushing it to disk. This helps increase
 > their perceived write performance (perceived in the sense that if a
 > power outage occurs, data loss can occur). 
 > 

Correct and application can influence the behavior this with
O_DSYNC,Fsync...

 > ZFS on the other hand, performs copy-on-write to ensure that the disk
 > is always consistent, I see this as sort of being equivalent to using
 > a directio option. I understand that the data is written first, then
 > the points are updated, but if I were to use the directio analogy,
 > would this be correct? 

As pointed out by Anton. That''s a no here.
The COW ensures that ZFS is always consistent but it''s not
really related to application consistency (that''s the job of
O_DSYNC,fsync)...

So ZFS caches data on writes like most FS.

 > 
 > If that is the case, then is it true that ZFS really does not use a
 > write cache at all? And if it does, then how is it used? 
 > 

you write to cache and every 5 seconds, all the dirty data
if shipped to disk in a transaction group. On low memory we
also will not wait for the 5 second clock to hit and issue a 
txg.

The problem you and many face, is lack of write throttling.
This is being worked on and should be fix I hope soon.
The perception that ZFS is Ram hungry will have to be
reevaluated at that time.
See:
	6429205 each zpool needs to monitor it''s  throughput and throttle
heavy writers


 > Read Cache.
 > 
 > Any of us that have started using or benchmakring ZFS, have seen its
 > voracious appetite for memory, an appetite that is fully shared with
 > VxFS for example, as I am not singling out ZFS (I''m rather a
fan).  On
 > reboot of my T2000 test server (32GB Ram) I see that the arc cache max
 > size is set to 30.88GB - a sizeable piece of memory.  
 > 
 > Now, is all that cache space only for read cache? (given my assumption
 > regarding write cache) 
 > 
 > Tuneable Parameters:
 > I know that the philosophy of ZFS is that you should never have to
 > tune your file system, but  might I suggest, that tuning the FS is not
 > always a bad thing. You can''t expect a FS to be all things for
all
 > people. If there are variables that can be modified to provide
 > different performance characteristics and profiles, then I would
 > contend that it could strengthen ZFS and lead to wider adoption and
 > acceptance if you could, for example, limit the amount of memory used
 > by items like the cache without messing with c_max / c_min directly in
 > the kernel. 
 > 

Once we have write throttling, we will be better equipped to 
see if the ARC dynamical adjustments works or not. I believe 
most problems will go away and there will be less demand for 
such a tunable...

On to your next mail...


 > -Tony
 >  
 >  
 > This message posted from opensolaris.org
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Roch - PAE

2007-Apr-20 14:42 UTC

head link

[zfs-discuss] Re: Help me understand ZFS caching

Tony Galway writes:
 > Let me elaborate slightly on the reason I ask these questions.
 > 
 > I am performing some simple benchmarking, and during this a file is
 > created by sequentially writing 64k blocks until the 100Gb file is
 > created. I am seeing, and this is the exact same as VxFS, large pauses
 > while the system reclaims the memory that it has consumed. 
 > 
 > I assume that since ZFS (back to the write cache question) is
 > copy-on-write and is not write caching anything (correct me if I am
 > wrong), it is instead using memory for my read-cache. Also, since I
 > have 32Gb of memory the reclaim periods are quite long while it frees
 > this memory - basically rendering my volume unusable until that memory
 > is reclaimed. 
 > 
 > With VxFS I was able to tune the file system with write_throttle, and
 > this allowed me to find a balance basically whereby the system writes
 > crazy fast, and then reclaims memory, and repeats that cycle. 
 > 
 > I guess I could modify c_max in the kernel, to provide the same type
 > of result, but this is not a supported tuning practice - and thus I do
 > not want to do that. 
 > 
 > I am simply trying to determine where ZFS is different, the same, and
 > where how I can modify its default behaviours (or if I ever will). 
 > 
 > Also, FYI, I''m testing on Solaris 10 11/06 (All testing must be
 > performed in production versions of Solaris) but if there are changes
 > in Nevada that will show me different results, I would be interested
 > in those as an aside. 
 >  

Today, a txg sync can take a very long time for this type of 
workload. A first goal of write throttling will be to at
least bound the sync times. The amount of dirty memory (not quickly 
reclaimable) will then be limited and ARC should be much
better at adjusting itself. A second goal will be to keep
sync times close to 5 seconds further limiting the RAM
consumption.



 >  
 > This message posted from opensolaris.org
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Tony Galway

2007-Apr-20 15:23 UTC

head link

[zfs-discuss] Re: Help me understand ZFS caching

Anton & Roch,

Thank you for helping me understand this. I didn''t want to make too
many assumptions that were unfounded and then incorrectly relay that information
back to clients.

So if I might just repeat your statements, so my slow mind is sure it
understands, and Roch, yes your assumption is correct that I am referencing File
System Cache, not disk cache.

A. Copy-on-write exists solely to ensure on disk data integrity, and as Anton
pointed out it is completely different than DirectIO.

b. ZFS still avail''s itself of a file system cache, and therefore, it
is possible that data can be lost if it hasn''t been written to disk and
the server fails.

c. The write throttling issue is known, and being looked at - when it is fixed
we don''t know?  I''ll add myself to the notification list as an
interested party :)

Now to another question related to Anton''s post. You mention that
directIO does not exist in ZFS at this point. Are their plan''s to
support DirectIO; any functionality that will simulate directIO or some other
non-caching ability suitable for critical systems such as databases if the
client still wanted to deploy on filesystems.
 
 
This message posted from opensolaris.org

eric kustarz

2007-Apr-20 15:25 UTC

head link

[zfs-discuss] Re: Help me understand ZFS caching

On Apr 20, 2007, at 10:47 AM, Anton B. Rang wrote:
> ZFS uses caching heavily as well; much more so, in fact, than UFS.
>
> Copy-on-write and direct i/o are not related. As you say, data gets  
> written first, then the metadata which points to it, but this
isn''t
> anything like direct I/O. In particular, direct I/O avoids caching  
> the data, instead transferring it directly to/from user buffers,  
> while ZFS-style copy-on-write caches all data. ZFS does not have  
> direct I/O at all right now.
You''re context is correct, but i''d be careful with
"direct I/O", as i
think its an overloaded term that most people don''t understand what  
it does - just that it got them good performance (somehow).  Roch has  
a blog on this:
http://blogs.sun.com/roch/entry/zfs_and_directio

But you are correct that ZFS does not have the ability for the user  
to say "don''t cache user data for this filesystem" (which is
one part
of direct I/O).

I''ve talked to some database people and they aren''t convinced
having
this feature would be a win.  So if someone has a real world workload  
where having the ability to purposely not cache user data would be a  
win, please let me know.

eric

Roch - PAE

2007-Apr-20 15:31 UTC

head link

[zfs-discuss] Re: Help me understand ZFS caching

Tony Galway writes:

 > Anton & Roch,
 > 
 > Thank you for helping me understand this. I didn''t want
to make too many assumptions that were unfounded and then
incorrectly relay that information back to clients. 
 > 
 > So if I might just repeat your statements, so my slow mind is sure it
understands, and Roch, yes your assumption is correct that I am referencing File
System Cache, not disk cache.
 > 
 > A. Copy-on-write exists solely to ensure on disk data
integrity, and as Anton pointed out it is completely
different than DirectIO. 

I would say ''ensure pool integrity'' but you get the idea.
 > 
 > b. ZFS still avail''s itself of a file system cache, and
therefore, it is possible that data can be lost if it hasn''t
been written to disk and the server fails.

Yep.

 > 
 > c. The write throttling issue is known, and being looked
at - when it is fixed we don''t know?  I''ll add myself to the
notification list as an interested party :)

Yep.

 > 
 > Now to another question related to Anton''s post. You mention that
directIO does not exist in ZFS at this point. Are their plan''s to
support DirectIO; any functionality that will simulate directIO or some other
non-caching ability suitable for critical systems such as databases if the
client still wanted to deploy on filesystems.
 >  

here Anton and I disagree on this. I believe that ZFS
design would not gain much performance from something we''d call
directio. See:

	http://blogs.sun.com/roch/entry/zfs_and_directio

-r

 >  
 > This message posted from opensolaris.org
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Anton B. Rang

2007-Apr-20 20:02 UTC

head link

[zfs-discuss] Re: Re: Help me understand ZFS caching

> So if someone has a real world workload where having the ability to
purposely not cache user
> data would be a win, please let me know.
Multimedia streaming is an obvious one.

For databases, it depends on the application, but in general the database will
do a better job of selecting which data to keep in memory than the file system
can. (Of course, some low-end databases rely on the file system for this.)
 
 
This message posted from opensolaris.org

johansen-osdev at sun.com

2007-Apr-20 21:00 UTC

head link

[zfs-discuss] Re: Help me understand ZFS caching

Tony:
> Now to another question related to Anton''s post. You mention that
> directIO does not exist in ZFS at this point. Are their plan''s to
> support DirectIO; any functionality that will simulate directIO or
> some other non-caching ability suitable for critical systems such as
> databases if the client still wanted to deploy on filesystems.
I would describe DirectIO as the ability to map the application''s
buffers directly for disk DMAs.  You need to disable the filesystem''s
cache to do this correctly.  Having the cache disabled is an
implementation requirement for this feature.

Based upon this definition, are you seeking the ability to disable the
filesystem''s cache or the ability to directly map application buffers
for DMA?

-j

eric kustarz

2007-Apr-23 23:17 UTC

head link

[zfs-discuss] Re: Re: Help me understand ZFS caching

On Apr 20, 2007, at 1:02 PM, Anton B. Rang wrote:
>> So if someone has a real world workload where having the ability  
>> to purposely not cache user
>> data would be a win, please let me know.
>
> Multimedia streaming is an obvious one.
assuming a single reader? or multiple readers at the same spot?
>
> For databases, it depends on the application, but in general the  
> database will do a better job of selecting which data to keep in  
> memory than the file system can. (Of course, some low-end databases  
> rely on the file system for this.)
Yeah, but from what i''ve heard is that the database will start up  
asking for a chunk of memory (and get it).  The ARC will then react  
accordingly to what memory is left and not grow too big to cause the  
database to "shrink" its memory hold.

eric

zfs discuss - Apr 2007 - Help me understand ZFS caching

[zfs-discuss] Help me understand ZFS caching

[zfs-discuss] Re: Help me understand ZFS caching

[zfs-discuss] Re: Help me understand ZFS caching

[zfs-discuss] Help me understand ZFS caching

[zfs-discuss] Re: Help me understand ZFS caching

[zfs-discuss] Re: Help me understand ZFS caching

[zfs-discuss] Re: Help me understand ZFS caching

[zfs-discuss] Re: Help me understand ZFS caching

[zfs-discuss] Re: Re: Help me understand ZFS caching

[zfs-discuss] Re: Help me understand ZFS caching

[zfs-discuss] Re: Re: Help me understand ZFS caching