I wonder if it is possible (currently or in the future as an RFE) to tell ZFS to automatically read-ahead some files and cache them in RAM and/or L2ARC? One use-case would be for Home-NAS setups where multimedia (video files or catalogs of images/music) are viewed form a ZFS box. For example, if a user wants to watch a film, or listen to a playlist of MP3''s, or push photos to a wall display (photo frame, etc.), the storage box "should" read-ahead all required data from HDDs and save it in ARC/L2ARC. Then the HDDs can spin down for hours while the pre-fetched gigabytes of data are used by consumers from the cache. End-users get peace, quiet and less electricity used while they enjoy their multimedia entertainment ;) Is it possible? If not, how hard would it be to implement? In terms of scripting, would it suffice to detect reads (i.e. with DTrace) and read the files to /dev/null to get them cached along with all required metadata (so that mechanical HDDs are not required for reads afterwards)? Thanks, //Jim Klimov
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > I wonder if it is possible (currently or in the future as an RFE) > to tell ZFS to automatically read-ahead some files and cache them > in RAM and/or L2ARC? > > One use-case would be for Home-NAS setups where multimedia (video > files or catalogs of images/music) are viewed form a ZFS box. For > example, if a user wants to watch a film, or listen to a playlist > of MP3''s, or push photos to a wall display (photo frame, etc.), > the storage box "should" read-ahead all required data from HDDs > and save it in ARC/L2ARC. Then the HDDs can spin down for hours > while the pre-fetched gigabytes of data are used by consumers > from the cache. End-users get peace, quiet and less electricity > used while they enjoy their multimedia entertainment ;)This whole subject is important and useful - and not unique to ZFS. The whole question is, how can the system predict which things are going to be requested next? In the case of a video - there''s a big file which is likely to be read sequentially. I don''t know how far readahead currently will read ahead, but it is surely only smart enough to stay within a single file. If the readahead buffer starts to get low, and the disks have been spun down, I don''t know how low the buffer gets before it will trigger more readahead. But at least in the case of streaming video files, there''s a very realistic possibility that something like the existing readahead can do what you want. In the case of your MP3 collection... Probably the only thing you can do is to write a script which will simply go read all the files you predict will be read soon. The key here is the prediction - There''s no way ZFS or solaris, or any other OS in the present day is going to intelligently predict which files you''ll be requesting soon. But you, the user, who knows your usage patterns, might be able to make these predictions and request to cache them. The request is simply - telling the system to start reading those files now. So it''s very easy to cache, as long as you know what to cache.
On 01/08/12 09:30, Edward Ned Harvey wrote:> In the case of your MP3 collection... Probably the only thing you can do is > to write a script which will simply go read all the files you predict will > be read soon. The key here is the prediction - There''s no way ZFS or > solaris, or any other OS in the present day is going to intelligently > predict which files you''ll be requesting soon.The other prediction is whether the blocks will be reused. If the blocks of a streaming read are only used once, then it may be wasteful for a file system to allow these blocks to placed in the cache. If a file system purposely chooses to not cache streaming reads, manually scheduling a "pre-read" of particular files may simply cause the file to be read from disk twice: on the manual pre-read and when it is read again by the actual application. I believe Joerg Moellenkamp published a discussion several years ago on how L1ARC attempt to deal with the pollution of the cache by large streaming reads, but I don''t have a bookmark handy (nor the knowledge of whether the behavior is still accurate).
2012-01-08 19:15, John Martin ?????:> On 01/08/12 09:30, Edward Ned Harvey wrote: > >> In the case of your MP3 collection... Probably the only thing you can >> do is >> to write a script which will simply go read all the files you predict >> will >> be read soon. The key here is the prediction - There''s no way ZFS or >> solaris, or any other OS in the present day is going to intelligently >> predict which files you''ll be requesting soon. > > > The other prediction is whether the blocks will be reused. > If the blocks of a streaming read are only used once, then > it may be wasteful for a file system to allow these blocks > to placed in the cache. If a file system purposely > chooses to not cache streaming reads, manually scheduling a > "pre-read" of particular files may simply cause the file to be read > from disk twice: on the manual pre-read and when it is read again > by the actual application. > > I believe Joerg Moellenkamp published a discussion > several years ago on how L1ARC attempt to deal with the pollution > of the cache by large streaming reads, but I don''t have > a bookmark handy (nor the knowledge of whether the > behavior is still accurate).Well, this point is valid for intensively-used servers - but then such blocks might just get evicted from the caches by newer and/or more-frequently-used blocks. However for smaller servers, such as home NASes which have about one user overall, pre-reading and caching files even for a single use might be an objective per se - just to let the hard-disks spin down. Say, if I sit down to watch a movie from my NAS, it is likely that for 90 or 120 minutes there will be no other IO initiated by me. The movie file can be pre-read in a few seconds, and then most of the storage system can go to sleep. //Jim
On 01/08/12 11:30, Jim Klimov wrote:> However for smaller servers, such as home NASes which have > about one user overall, pre-reading and caching files even > for a single use might be an objective per se - just to let > the hard-disks spin down. Say, if I sit down to watch a > movie from my NAS, it is likely that for 90 or 120 minutes > there will be no other IO initiated by me. The movie file > can be pre-read in a few seconds, and then most of the > storage system can go to sleep.Isn''t this just a more extreme case of prediction? In addition to the file system knowing there will only be one client reading 90-120 minutes of (HD?) video that will fit in the memory of a small(er) server, now the hard drive power management code also knows there won''t be another access for 90-120 minutes so it is OK to spin down the hard drive(s).
2012-01-09 0:29, John Martin ?????:> On 01/08/12 11:30, Jim Klimov wrote: > >> However for smaller servers, such as home NASes which have >> about one user overall, pre-reading and caching files even >> for a single use might be an objective per se - just to let >> the hard-disks spin down. Say, if I sit down to watch a >> movie from my NAS, it is likely that for 90 or 120 minutes >> there will be no other IO initiated by me. The movie file >> can be pre-read in a few seconds, and then most of the >> storage system can go to sleep.I can''t find such home-NAS usage uncommon, because I am my own example user - so I see this pattern often ;)> Isn''t this just a more extreme case of prediction?Probably is, and this is probably not a task for only ZFS, but for logic outside it. There are some requirements that ZFS should meet, in order for this to work, though. Details follow...> In addition to the file system knowing there will only > be one client reading 90-120 minutes of (HD?) video > that will fit in the memory of a small(er) server, > now the hard drive power management code also knows there > won''t be another access for 90-120 minutes so it is OK > to spin down the hard drive(s).Well, in the original post I did suggest that the prediction logic might go into scripting or some other user-level tool. And it should, really, to keep the kernel clean and slim. The "predictor" might be as simple as a DTrace file access monitor, which would "cat" or "tar" files into /dev/null. I.e. if it detected access to "*.(avi|mkv|wmv)", then it should cat the file. If it detected "*.(mp3|ogg|jpg)" it should tar the parent directory. Might be dumb and still sufficiently efficient ;) However, for such usecases this tool would need some "guarantees" from ZFS. One would be that the read-ahead data will find its way into caches and won''t be evicted for no reason (when there''s no other RAM pressure). This means that the tool should be able to read all the data and metadata required by ZFS, so that no more disk access is required if it''s all in cache. It might require a tunable in ZFS for home-NAS users which would disable current "no-caching" for detected streaming reads: we need the opposite of that behavior. Another part is HDD power-management, which reportedly works in Solaris, allowing disks to spin down when there was no access for some time. Probably there is a syscall to do this on-demand as well... On a side note, for home-NASes or other not-heavily-used storage servers, it would be wonderful to be able to cache small writes into ZIL devices, if present, and not flush them onto the main pool until some megabyte limit is reached (i.e. ZIL is full), or a pool export/import event occurs. This would allow main disk arrays to remain idle for a long time while small sporadic writes which are initiated by the OS (logs, atimes, web-browser cache files, whatever), and have these writes persistently stored in ZIL. Essentially, this would be like setting TXG-commit times to practical infinity, and actually commit based on bytecount limits. One possible difference would be not-streaming larger writes to pool disks at once, but also storing them in dedicated ZIL.
On Jan 7, 2012, at 8:59 AM, Jim Klimov wrote:> I wonder if it is possible (currently or in the future as an RFE) > to tell ZFS to automatically read-ahead some files and cache them > in RAM and/or L2ARC?See discussions on the ZFS intelligent prefetch algorithm. I think Ben Rockwood''s description is the best general description: http://www.cuddletech.com/blog/pivot/entry.php?id=1040 And a more engineer-focused description is at: http://www.solarisinternals.com/wiki/index.php/ZFS_Performance#Intelligent_prefetch -- richard> > One use-case would be for Home-NAS setups where multimedia (video > files or catalogs of images/music) are viewed form a ZFS box. For > example, if a user wants to watch a film, or listen to a playlist > of MP3''s, or push photos to a wall display (photo frame, etc.), > the storage box "should" read-ahead all required data from HDDs > and save it in ARC/L2ARC. Then the HDDs can spin down for hours > while the pre-fetched gigabytes of data are used by consumers > from the cache. End-users get peace, quiet and less electricity > used while they enjoy their multimedia entertainment ;) > > Is it possible? If not, how hard would it be to implement? > > In terms of scripting, would it suffice to detect reads (i.e. > with DTrace) and read the files to /dev/null to get them cached > along with all required metadata (so that mechanical HDDs are > not required for reads afterwards)? > > Thanks, > //Jim Klimov > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ZFS and performance consulting http://www.RichardElling.com illumos meetup, Jan 10, 2012, Menlo Park, CA http://www.meetup.com/illumos-User-Group/events/41665962/
2012-01-09 4:14, Richard Elling ?????:> On Jan 7, 2012, at 8:59 AM, Jim Klimov wrote: > >> I wonder if it is possible (currently or in the future as an RFE) >> to tell ZFS to automatically read-ahead some files and cache them >> in RAM and/or L2ARC? > > See discussions on the ZFS intelligent prefetch algorithm. I think Ben Rockwood''s > description is the best general description: > http://www.cuddletech.com/blog/pivot/entry.php?id=1040 > > And a more engineer-focused description is at: > http://www.solarisinternals.com/wiki/index.php/ZFS_Performance#Intelligent_prefetch > -- richardThanks for the pointers. While I''ve seen those articles (in fact, one of the two non-spam comments in Ben''s blog was mine), rehashing the basics is always useful ;) Still, how does VDEV prefetch play along with File-level Prefetch? For example, if ZFS prefetched 64K from disk at the SPA level, and those sectors luckily happen to contain "next" blocks of a streaming-read file, would the file-level prefetch take the data from RAM cache or still request them from the disk? In what cases would it make sense to increase the zfs_vdev_cache_size? Does it apply to all disks combined, or to each disk (or even slice/partition) separately? In fact, this reading got me thinking that I might have a fundamental misunderstanding lately; hence a couple of new yes-no questions arose: Is it true or false that: ZFS might skip the cache and go to disks for "streaming" reads? (The more I think about it, the more senseless this sentence seems, and I might have just mistaken it with ZIL writes of bulk data). Is it true or false that: ARC might evict cached blocks based on age (without new reads or other processes requiring the RAM space)? And I guess the generic answer to my original question regarding intelligent pre-fetching of whole files is that this should be done by scripts outside ZFS itself, and that the read-prefetch as well as ARC/L2ARC is all in place already. So if no other IOs occur, the disks may spin down... if only not for those "nasty" writes that may sporadically occur and which I''d love to see pushed out to dedicated ZILs ;) Thanks, //Jim
On Jan 8, 2012, at 5:10 PM, Jim Klimov wrote:> 2012-01-09 4:14, Richard Elling ?????: >> On Jan 7, 2012, at 8:59 AM, Jim Klimov wrote: >> >>> I wonder if it is possible (currently or in the future as an RFE) >>> to tell ZFS to automatically read-ahead some files and cache them >>> in RAM and/or L2ARC? >> >> See discussions on the ZFS intelligent prefetch algorithm. I think Ben Rockwood''s >> description is the best general description: >> http://www.cuddletech.com/blog/pivot/entry.php?id=1040 >> >> And a more engineer-focused description is at: >> http://www.solarisinternals.com/wiki/index.php/ZFS_Performance#Intelligent_prefetch >> -- richard > > Thanks for the pointers. While I''ve seen those articles > (in fact, one of the two non-spam comments in Ben''s > blog was mine), rehashing the basics is always useful ;) > > Still, how does VDEV prefetch play along with File-level > Prefetch?Trick question? it doesn''t. vdev prefetching is disabled in opensolaris b148, illumos, and Solaris 11 releases. The benefits of having the vdev cache for large numbers of disks does not appear to justify the cost. See http://wesunsolve.net/bugid/id/6684116 https://www.illumos.org/issues/175> For example, if ZFS prefetched 64K from disk > at the SPA level, and those sectors luckily happen to > contain "next" blocks of a streaming-read file, would > the file-level prefetch take the data from RAM cache > or still request them from the disk?As of b70, vdev_cache only contains metadata. See http://wesunsolve.net/bugid/id/6437054> In what cases would it make sense to increase the > zfs_vdev_cache_size? Does it apply to all disks > combined, or to each disk (or even slice/partition) > separately?It applies to each leaf vdev.> > In fact, this reading got me thinking that I might have > a fundamental misunderstanding lately; hence a couple > of new yes-no questions arose: > > Is it true or false that: ZFS might skip the cache and > go to disks for "streaming" reads? (The more I think > about it, the more senseless this sentence seems, and > I might have just mistaken it with ZIL writes of bulk > data).Unless the primarycache parameter is set to none, reads will look in the ARC first.> > Is it true or false that: ARC might evict cached blocks > based on age (without new reads or other processes > requiring the RAM space)?False. Evictions occur when needed. NB, I''m not sure of the status of the Solaris 11 ARC no-grow issue. As that code is not open sourced, and we know that Oracle rewrote some of the ARC code, all bets are off.> And I guess the generic answer to my original question > regarding intelligent pre-fetching of whole files is > that this should be done by scripts outside ZFS itself, > and that the read-prefetch as well as ARC/L2ARC is all > in place already. So if no other IOs occur, the disks > may spin down... if only not for those "nasty" writes > that may sporadically occur and which I''d love to see > pushed out to dedicated ZILs ;)I''ve setup external prefetching for specific use cases. Spin-down is another can of worms? -- richard -- ZFS and performance consulting http://www.RichardElling.com illumos meetup, Jan 10, 2012, Menlo Park, CA http://www.meetup.com/illumos-User-Group/events/41665962/
On 01/08/12 20:10, Jim Klimov wrote:> Is it true or false that: ZFS might skip the cache and > go to disks for "streaming" reads?I don''t believe this was ever suggested. Instead, if data is not already in the file system cache and a large read is made from disk should the file system put this data into the cache? BTW, I chose the term streaming to be a subset of sequential where the access pattern is sequential but at what appears to be artificial time intervals. The suggested pre-read of the entire file would be a simple sequential read done as quickly as the hardware allows.
Thanks for the replies, some more questions follow. Your answers below seem to contradict each other somewhat. Is it true that: 1) VDEV cache before b70 used to contain a full copy of prefetched disk contents, 2) VDEV cache since b70 analyzes the prefetched sectors and only keeps metadata blocks, 3) VDEV cache since b148 is disabled by default? So in fact currently we only have file-level "intelligent" prefetching? On my older systems I fired "kstat -p zfs:0:vdev_cache_stats" and saw hit/miss ratios ranging from 30% to 70%. On the oi_148a box I do indeed see all-zeros. While I do understand the implications of VDEV-caching lots of disks on systems with inadequate RAM, I tend to find this feature useful on smaller systems - like home-NASes. It is essentially free in terms of mechanical seeks, as well as in RAM (what is 60-100Mb for a small box at home?) and any nonzero hit ratio that speeds up the system seems justifiable ;) I''ve tried playing with the options on my oi_148a LiveUSB repair boot, and got varying results: VDEV is indeed disabled by default, but can be enabled. My system is scrubbing now, so it''s got a few cache hits (about 10%) right away. root at openindiana:~# echo zfs_vdev_cache_size/W0t10000000 | mdb -kw zfs_vdev_cache_size: 0 = 0x989680 root at openindiana:~# kstat -p zfs:0:vdev_cache_stats zfs:0:vdev_cache_stats:class misc zfs:0:vdev_cache_stats:crtime 65.042318652 zfs:0:vdev_cache_stats:delegations 72 zfs:0:vdev_cache_stats:hits 11 zfs:0:vdev_cache_stats:misses 158 zfs:0:vdev_cache_stats:snaptime 114232.782154249 However, trying to increase the prefetch size hung my system almost immediately (in a couple of seconds). I''m away from it now, so I''ll ask for a photo of the console screen :) root at openindiana:~# echo zfs_vdev_cache_max/W0t16384 | mdb -kw zfs_vdev_cache_max: 0x4000 = 0x4000 root at openindiana:~# echo zfs_vdev_cache_bshift/W0t20 | mdb -kw zfs_vdev_cache_bshift: 0x10 = 0x14 So there are deeper questions: 1) As of Illumos bug #175 (as well as OpenSolaris b148 and if known - Solaris 11), is the vdev prefetch feature *removed* from codebase ("no" as of oi_148a, what about others?), or disabled by default (i.e. limit is preset to 0, tune it yourself)? 2) If it is only disabled, are there solid plans to remove it, or can we vote to keep it for those interested? :) 3) If the feature is present and gets enabled, how would VDEV prefetch play along with file prefetch, again? ;) 4) Is there some tuneable (after b70) to enable prefetching and keeping of user-data as well (not only metadata)? Perhaps only so that I could test it with my use-patterns to make sure that caching generic sectors is useless for me, and I really should revert to caching only metadata? 5) Would it make sense to increase zfs_vdev_cache_bshift? For example, when I tried to set it to 20 and prefetch a whole 1MB of data, why would that cause the system to die? Can it increase cache hit ratios (if works)? 6) Does the VDEV cache keep ZFS blocks or disk sectors? For example, on my 4k disks the blocks are 4k, even though there are a few hundred bytes worth of data in metadata blocks and 3+KB of slack space. 7) Modern HDDs often have 32-64Mb DRAM cache onboard. Is there any reason to match VDEV cache size with that in any way (1:1, 2:1, etc)? Thanks again, //Jim Klimov 2012-01-09 6:06, Richard Elling wrote:> On Jan 8, 2012, at 5:10 PM, Jim Klimov wrote: >> 2012-01-09 4:14, Richard Elling ?????: >>> On Jan 7, 2012, at 8:59 AM, Jim Klimov wrote: >>> >>>> I wonder if it is possible (currently or in the future as an RFE) >>>> to tell ZFS to automatically read-ahead some files and cache them >>>> in RAM and/or L2ARC? >>> >>> See discussions on the ZFS intelligent prefetch algorithm. I think Ben Rockwood''s >>> description is the best general description: >>> http://www.cuddletech.com/blog/pivot/entry.php?id=1040 >>> >>> And a more engineer-focused description is at: >>> http://www.solarisinternals.com/wiki/index.php/ZFS_Performance#Intelligent_prefetch >>> -- richard >> >> Thanks for the pointers. While I''ve seen those articles >> (in fact, one of the two non-spam comments in Ben''s >> blog was mine), rehashing the basics is always useful ;) >> >> Still, how does VDEV prefetch play along with File-level >> Prefetch? > > Trick question? it doesn''t. vdev prefetching is disabled in opensolaris b148, illumos, > and Solaris 11 releases. The benefits of having the vdev cache for large numbers of > disks does not appear to justify the cost. See > http://wesunsolve.net/bugid/id/6684116 > https://www.illumos.org/issues/175 > >> For example, if ZFS prefetched 64K from disk >> at the SPA level, and those sectors luckily happen to >> contain "next" blocks of a streaming-read file, would >> the file-level prefetch take the data from RAM cache >> or still request them from the disk? > > As of b70, vdev_cache only contains metadata. See > http://wesunsolve.net/bugid/id/6437054 > >> In what cases would it make sense to increase the >> zfs_vdev_cache_size? Does it apply to all disks >> combined, or to each disk (or even slice/partition) >> separately? > > It applies to each leaf vdev.
2012-01-09 18:15, John Martin ?????:> On 01/08/12 20:10, Jim Klimov wrote: > >> Is it true or false that: ZFS might skip the cache and >> go to disks for "streaming" reads?>> (The more I think >> about it, the more senseless this sentence seems, and >> I might have just mistaken it with ZIL writes of bulk >> data).> > I don''t believe this was ever suggested. Instead, if > data is not already in the file system cache and a > large read is made from disk should the file system > put this data into the cache?Hmmm... perhaps THIS is what I could mistake it for... Thus the correct version of the question goes like this: is it true or false that some large reads from disk can be deemed by ZFS as "too big and rare to cache in ARC"? If yes, what conditions are checked to mark a read as such? Can this behavior be disabled in order to try and cache every read (further subject to normal eviction due to MRU/MFU/memory pressure and other considerations)? Thanks again, //Jim Klimov
On 01/08/12 10:15, John Martin wrote:> I believe Joerg Moellenkamp published a discussion > several years ago on how L1ARC attempt to deal with the pollution > of the cache by large streaming reads, but I don''t have > a bookmark handy (nor the knowledge of whether the > behavior is still accurate).http://www.c0t0d0s0.org/archives/5329-Some-insight-into-the-read-cache-of-ZFS-or-The-ARC.html
To follow on the subject of VDEV caching, even if only of metadata, in oi_148a, I have found the disabling entry in /etc/system of the LiveUSB: set zfs:zfs_vdev_cache_size=0 Now that I have the cache turned on and my scrub continues, cache efficiency so far happens to be 75%. Not bad for a feature turned off by default: # kstat -p zfs:0:vdev_cache_stats zfs:0:vdev_cache_stats:class misc zfs:0:vdev_cache_stats:crtime 60.67302806 zfs:0:vdev_cache_stats:delegations 22619 zfs:0:vdev_cache_stats:hits 32989 zfs:0:vdev_cache_stats:misses 10676 zfs:0:vdev_cache_stats:snaptime 39898.161717983 //Jim
2012-01-11 1:26, Jim Klimov ?????:> To follow on the subject of VDEV caching, even if > only of metadata, in oi_148a, I have found the > disabling entry in /etc/system of the LiveUSB: > > set zfs:zfs_vdev_cache_size=0 > > > Now that I have the cache turned on and my scrub > continues, cache efficiency so far happens to be > 75%. Not bad for a feature turned off by default: > > # kstat -p zfs:0:vdev_cache_stats > zfs:0:vdev_cache_stats:class misc > zfs:0:vdev_cache_stats:crtime 60.67302806 > zfs:0:vdev_cache_stats:delegations 22619 > zfs:0:vdev_cache_stats:hits 32989 > zfs:0:vdev_cache_stats:misses 10676 > zfs:0:vdev_cache_stats:snaptime 39898.161717983 > > //JimAnd at this moment I can guess the caching effect becomes incredible (at least for a feature disabled and dismissed at useless/harmful) - if I read the numbers correctly, a 99+% cache hit ratio with just VDEV prereads: # kstat -p zfs:0:vdev_cache_stats zfs:0:vdev_cache_stats:class misc zfs:0:vdev_cache_stats:crtime 60.67302806 zfs:0:vdev_cache_stats:delegations 23398 zfs:0:vdev_cache_stats:hits 1309308 zfs:0:vdev_cache_stats:misses 11592 zfs:0:vdev_cache_stats:snaptime 89207.679698161 True, the task (scrubbing) is metadata-intensive :) Still, for the future, when beginning a scrub the system might auto-tune or at least suggest to enable the VDEV prefetch, perhaps with larger strokes)... BTW, what does the "delegations" field mean? ;) -- +============================================================+ | | | ?????? ???????, Jim Klimov | | ??????????? ???????? CTO | | ??? "??? ? ??" JSC COS&HT | | | | +7-903-7705859 (cellular) mailto:jimklimov at cos.ru | | CC:admin at cos.ru,jimklimov at gmail.com | +============================================================+ | () ascii ribbon campaign - against html mail | | /\ - against microsoft attachments | +============================================================+