I have a few questions regarding ZFS, and would appreciate if someone could enlighten me as I work my way through. First write cache. If I look at traditional UFS / VxFS type file systems, they normally cache metadata to RAM before flushing it to disk. This helps increase their perceived write performance (perceived in the sense that if a power outage occurs, data loss can occur). ZFS on the other hand, performs copy-on-write to ensure that the disk is always consistent, I see this as sort of being equivalent to using a directio option. I understand that the data is written first, then the points are updated, but if I were to use the directio analogy, would this be correct? If that is the case, then is it true that ZFS really does not use a write cache at all? And if it does, then how is it used? Read Cache. Any of us that have started using or benchmakring ZFS, have seen its voracious appetite for memory, an appetite that is fully shared with VxFS for example, as I am not singling out ZFS (I''m rather a fan). On reboot of my T2000 test server (32GB Ram) I see that the arc cache max size is set to 30.88GB - a sizeable piece of memory. Now, is all that cache space only for read cache? (given my assumption regarding write cache) Tuneable Parameters: I know that the philosophy of ZFS is that you should never have to tune your file system, but might I suggest, that tuning the FS is not always a bad thing. You can''t expect a FS to be all things for all people. If there are variables that can be modified to provide different performance characteristics and profiles, then I would contend that it could strengthen ZFS and lead to wider adoption and acceptance if you could, for example, limit the amount of memory used by items like the cache without messing with c_max / c_min directly in the kernel. -Tony This message posted from opensolaris.org
Let me elaborate slightly on the reason I ask these questions. I am performing some simple benchmarking, and during this a file is created by sequentially writing 64k blocks until the 100Gb file is created. I am seeing, and this is the exact same as VxFS, large pauses while the system reclaims the memory that it has consumed. I assume that since ZFS (back to the write cache question) is copy-on-write and is not write caching anything (correct me if I am wrong), it is instead using memory for my read-cache. Also, since I have 32Gb of memory the reclaim periods are quite long while it frees this memory - basically rendering my volume unusable until that memory is reclaimed. With VxFS I was able to tune the file system with write_throttle, and this allowed me to find a balance basically whereby the system writes crazy fast, and then reclaims memory, and repeats that cycle. I guess I could modify c_max in the kernel, to provide the same type of result, but this is not a supported tuning practice - and thus I do not want to do that. I am simply trying to determine where ZFS is different, the same, and where how I can modify its default behaviours (or if I ever will). Also, FYI, I''m testing on Solaris 10 11/06 (All testing must be performed in production versions of Solaris) but if there are changes in Nevada that will show me different results, I would be interested in those as an aside. This message posted from opensolaris.org
ZFS uses caching heavily as well; much more so, in fact, than UFS. Copy-on-write and direct i/o are not related. As you say, data gets written first, then the metadata which points to it, but this isn''t anything like direct I/O. In particular, direct I/O avoids caching the data, instead transferring it directly to/from user buffers, while ZFS-style copy-on-write caches all data. ZFS does not have direct I/O at all right now. One key difference between UFS & ZFS is that ZFS flushes the drive''s write cache at key points. (It does this rather than using ordered commands, even on SCSI disks, which to me is a little disappointing.) This guarantees that the data is on-disk before the associated metadata. UFS relies on keeping the write cache disabled to ensure that its journal is written to disk before its metadata, again with the goal of keeping the file system consistent at all times. I agree with you on tuning. It''s clearly desirable that the "out-of-box" settings for a file system work well for "general purpose" loads; but there are almost always applications which require a different strategy. This is much of why UFS/QFS/VxFS added direct i/o, and it''s why VxFS (which focuses heavily on database) added quick i/o. This message posted from opensolaris.org
Tony Galway writes: > I have a few questions regarding ZFS, and would appreciate if someone > could enlighten me as I work my way through. > > First write cache. > We often use write cache to designate the cache present at the disk level. Lets call this "disk write cache". Most FS will cache information in host memory. Let''s call this "FS cache". I think your questions are more about FS cache behavior for differnet types of loads.... > If I look at traditional UFS / VxFS type file systems, they normally > cache metadata to RAM before flushing it to disk. This helps increase > their perceived write performance (perceived in the sense that if a > power outage occurs, data loss can occur). > Correct and application can influence the behavior this with O_DSYNC,Fsync... > ZFS on the other hand, performs copy-on-write to ensure that the disk > is always consistent, I see this as sort of being equivalent to using > a directio option. I understand that the data is written first, then > the points are updated, but if I were to use the directio analogy, > would this be correct? As pointed out by Anton. That''s a no here. The COW ensures that ZFS is always consistent but it''s not really related to application consistency (that''s the job of O_DSYNC,fsync)... So ZFS caches data on writes like most FS. > > If that is the case, then is it true that ZFS really does not use a > write cache at all? And if it does, then how is it used? > you write to cache and every 5 seconds, all the dirty data if shipped to disk in a transaction group. On low memory we also will not wait for the 5 second clock to hit and issue a txg. The problem you and many face, is lack of write throttling. This is being worked on and should be fix I hope soon. The perception that ZFS is Ram hungry will have to be reevaluated at that time. See: 6429205 each zpool needs to monitor it''s throughput and throttle heavy writers > Read Cache. > > Any of us that have started using or benchmakring ZFS, have seen its > voracious appetite for memory, an appetite that is fully shared with > VxFS for example, as I am not singling out ZFS (I''m rather a fan). On > reboot of my T2000 test server (32GB Ram) I see that the arc cache max > size is set to 30.88GB - a sizeable piece of memory. > > Now, is all that cache space only for read cache? (given my assumption > regarding write cache) > > Tuneable Parameters: > I know that the philosophy of ZFS is that you should never have to > tune your file system, but might I suggest, that tuning the FS is not > always a bad thing. You can''t expect a FS to be all things for all > people. If there are variables that can be modified to provide > different performance characteristics and profiles, then I would > contend that it could strengthen ZFS and lead to wider adoption and > acceptance if you could, for example, limit the amount of memory used > by items like the cache without messing with c_max / c_min directly in > the kernel. > Once we have write throttling, we will be better equipped to see if the ARC dynamical adjustments works or not. I believe most problems will go away and there will be less demand for such a tunable... On to your next mail... > -Tony > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Tony Galway writes: > Let me elaborate slightly on the reason I ask these questions. > > I am performing some simple benchmarking, and during this a file is > created by sequentially writing 64k blocks until the 100Gb file is > created. I am seeing, and this is the exact same as VxFS, large pauses > while the system reclaims the memory that it has consumed. > > I assume that since ZFS (back to the write cache question) is > copy-on-write and is not write caching anything (correct me if I am > wrong), it is instead using memory for my read-cache. Also, since I > have 32Gb of memory the reclaim periods are quite long while it frees > this memory - basically rendering my volume unusable until that memory > is reclaimed. > > With VxFS I was able to tune the file system with write_throttle, and > this allowed me to find a balance basically whereby the system writes > crazy fast, and then reclaims memory, and repeats that cycle. > > I guess I could modify c_max in the kernel, to provide the same type > of result, but this is not a supported tuning practice - and thus I do > not want to do that. > > I am simply trying to determine where ZFS is different, the same, and > where how I can modify its default behaviours (or if I ever will). > > Also, FYI, I''m testing on Solaris 10 11/06 (All testing must be > performed in production versions of Solaris) but if there are changes > in Nevada that will show me different results, I would be interested > in those as an aside. > Today, a txg sync can take a very long time for this type of workload. A first goal of write throttling will be to at least bound the sync times. The amount of dirty memory (not quickly reclaimable) will then be limited and ARC should be much better at adjusting itself. A second goal will be to keep sync times close to 5 seconds further limiting the RAM consumption. > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Anton & Roch, Thank you for helping me understand this. I didn''t want to make too many assumptions that were unfounded and then incorrectly relay that information back to clients. So if I might just repeat your statements, so my slow mind is sure it understands, and Roch, yes your assumption is correct that I am referencing File System Cache, not disk cache. A. Copy-on-write exists solely to ensure on disk data integrity, and as Anton pointed out it is completely different than DirectIO. b. ZFS still avail''s itself of a file system cache, and therefore, it is possible that data can be lost if it hasn''t been written to disk and the server fails. c. The write throttling issue is known, and being looked at - when it is fixed we don''t know? I''ll add myself to the notification list as an interested party :) Now to another question related to Anton''s post. You mention that directIO does not exist in ZFS at this point. Are their plan''s to support DirectIO; any functionality that will simulate directIO or some other non-caching ability suitable for critical systems such as databases if the client still wanted to deploy on filesystems. This message posted from opensolaris.org
On Apr 20, 2007, at 10:47 AM, Anton B. Rang wrote:> ZFS uses caching heavily as well; much more so, in fact, than UFS. > > Copy-on-write and direct i/o are not related. As you say, data gets > written first, then the metadata which points to it, but this isn''t > anything like direct I/O. In particular, direct I/O avoids caching > the data, instead transferring it directly to/from user buffers, > while ZFS-style copy-on-write caches all data. ZFS does not have > direct I/O at all right now.You''re context is correct, but i''d be careful with "direct I/O", as i think its an overloaded term that most people don''t understand what it does - just that it got them good performance (somehow). Roch has a blog on this: http://blogs.sun.com/roch/entry/zfs_and_directio But you are correct that ZFS does not have the ability for the user to say "don''t cache user data for this filesystem" (which is one part of direct I/O). I''ve talked to some database people and they aren''t convinced having this feature would be a win. So if someone has a real world workload where having the ability to purposely not cache user data would be a win, please let me know. eric
Tony Galway writes: > Anton & Roch, > > Thank you for helping me understand this. I didn''t want to make too many assumptions that were unfounded and then incorrectly relay that information back to clients. > > So if I might just repeat your statements, so my slow mind is sure it understands, and Roch, yes your assumption is correct that I am referencing File System Cache, not disk cache. > > A. Copy-on-write exists solely to ensure on disk data integrity, and as Anton pointed out it is completely different than DirectIO. I would say ''ensure pool integrity'' but you get the idea. > > b. ZFS still avail''s itself of a file system cache, and therefore, it is possible that data can be lost if it hasn''t been written to disk and the server fails. Yep. > > c. The write throttling issue is known, and being looked at - when it is fixed we don''t know? I''ll add myself to the notification list as an interested party :) Yep. > > Now to another question related to Anton''s post. You mention that directIO does not exist in ZFS at this point. Are their plan''s to support DirectIO; any functionality that will simulate directIO or some other non-caching ability suitable for critical systems such as databases if the client still wanted to deploy on filesystems. > here Anton and I disagree on this. I believe that ZFS design would not gain much performance from something we''d call directio. See: http://blogs.sun.com/roch/entry/zfs_and_directio -r > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> So if someone has a real world workload where having the ability to purposely not cache user > data would be a win, please let me know.Multimedia streaming is an obvious one. For databases, it depends on the application, but in general the database will do a better job of selecting which data to keep in memory than the file system can. (Of course, some low-end databases rely on the file system for this.) This message posted from opensolaris.org
johansen-osdev at sun.com
2007-Apr-20 21:00 UTC
[zfs-discuss] Re: Help me understand ZFS caching
Tony:> Now to another question related to Anton''s post. You mention that > directIO does not exist in ZFS at this point. Are their plan''s to > support DirectIO; any functionality that will simulate directIO or > some other non-caching ability suitable for critical systems such as > databases if the client still wanted to deploy on filesystems.I would describe DirectIO as the ability to map the application''s buffers directly for disk DMAs. You need to disable the filesystem''s cache to do this correctly. Having the cache disabled is an implementation requirement for this feature. Based upon this definition, are you seeking the ability to disable the filesystem''s cache or the ability to directly map application buffers for DMA? -j
On Apr 20, 2007, at 1:02 PM, Anton B. Rang wrote:>> So if someone has a real world workload where having the ability >> to purposely not cache user >> data would be a win, please let me know. > > Multimedia streaming is an obvious one.assuming a single reader? or multiple readers at the same spot?> > For databases, it depends on the application, but in general the > database will do a better job of selecting which data to keep in > memory than the file system can. (Of course, some low-end databases > rely on the file system for this.)Yeah, but from what i''ve heard is that the database will start up asking for a chunk of memory (and get it). The ARC will then react accordingly to what memory is left and not grow too big to cause the database to "shrink" its memory hold. eric