I got a 256GB Crucial M4 to use for L2ARC for my OpenIndiana box. I added it to the tank pool and let it warm for a day or so. By that point, ''zpool iostat -v'' said the cache device had about 9GB of data, but (and this is what has me puzzled) kstat showed ZERO l2_hits. That''s right, zero. kstat | egrep "(l2_hits|l2_misses)" l2_hits 0 l2_misses 1143249 The box has 20GB of RAM (it''s actually a virtual machine on an ESXi host.) The datastore for the VMs is about 256GB. My first thought was everything is hitting in ARC, but that is clearly not the case, since it WAS gradually filling up the cache device. Maybe it''s possible that every single miss is never ever being re-read, but that seems unlikely, no? If the l2_hits was a small number, I''d think it just wasn''t giving me any bang for the buck, but zero sounds suspiciously like some kind of bug/mis-configuration. primarycache and secondarycache are both set to all. arc stats via arc_summary.pl: ARC Efficency: Cache Access Total: 12324974 Cache Hit Ratio: 87% 10826363 [Defined State for buffer] Cache Miss Ratio: 12% 1498611 [Undefined State for Buffer] REAL Hit Ratio: 68% 8469470 [MRU/MFU Hits Only] Data Demand Efficiency: 85% Data Prefetch Efficiency: 59% For the moment, I gave up and moved the SSD back to being my windows7 drive, where it does make a difference :) I''d be willing to shell out for another SSD, but only if I can gain some benefit from it. Any thoughts would be appreciated. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120910/101085af/attachment.html>
Dan, If you''re not already familiar with it, I find the following command useful. It shows the realtime total read commands, number hitting/missing ARC, number hitting missing L2ARC, breakdown of MRU/MFU etc. arcstat_v2.pl -f read,hits,mru,mfu,miss,hit%,l2read,l2hits,l2miss,l2hit%,arcsz,l2size,mrug,mfug 1 That version of Arcstat is from http://github.com/mharsch/arcstat For comparison, I''ve got about 650GB of VMs on each of my two Nexenta VSAs (16GB/240GB L2ARC). When it''s just ticking over at 50-1000r/s then 99% of that is going to ARC but I''m also seeing patches where it goes 2-5k reads and I''m seeing 20-80% l2arc hits. These have been running for about a week and, given my understanding of how L2ARC fills, I''d suggest maybe leaving it to warm up longer (e.g. 1-2 weeks?) caveat: I''m a complete newbie to zfs so I could be completely wrong ;) Cheers, James
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Sep-11 12:29 UTC
[zfs-discuss] Interesting question about L2ARC
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Dan Swartzendruber > > My first thought was everything is > hitting in ARC, but that is clearly not the case, since it WAS gradually filling up > the cache device.?When things become colder in the ARC, they expire to the L2ARC (or simply expire, bypassing the L2ARC). So it''s normal to start filling the L2ARC, even if you never hit anything in the L2ARC.> ARC Efficency: > ????????? Cache Access Total:???????????? 12324974 > ????????? Cache Hit Ratio:????? 87%?????? 10826363?????? [Defined StateThat is a REALLY high hit ratio for ARC. It sounds to me, you probably have enough ram in there, that nearly everything is being served from ARC.
Hmmm, but the "real hit ratio" was 68%? -----Original Message----- From: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) [mailto:opensolarisisdeadlongliveopensolaris at nedharvey.com] Sent: Tuesday, September 11, 2012 8:30 AM To: Dan Swartzendruber; zfs-discuss at opensolaris.org Subject: RE: [zfs-discuss] Interesting question about L2ARC> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Dan Swartzendruber > > My first thought was everything is > hitting in ARC, but that is clearly not the case, since it WAS > gradually filling up the cache device.When things become colder in the ARC, they expire to the L2ARC (or simply expire, bypassing the L2ARC). So it''s normal to start filling the L2ARC, even if you never hit anything in the L2ARC.> ARC Efficency: > ????????? Cache Access Total:???????????? 12324974 > ????????? Cache Hit Ratio:????? 87%?????? 10826363?????? [Defined > StateThat is a REALLY high hit ratio for ARC. It sounds to me, you probably have enough ram in there, that nearly everything is being served from ARC.
I think you may have a point. I''m also inclined to enable prefetch caching per Saso''s comment, since I don''t have massive throughput - latency is more important to me. -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of James H Sent: Tuesday, September 11, 2012 5:09 AM To: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Interesting question about L2ARC Dan, If you''re not already familiar with it, I find the following command useful. It shows the realtime total read commands, number hitting/missing ARC, number hitting missing L2ARC, breakdown of MRU/MFU etc. arcstat_v2.pl -f read,hits,mru,mfu,miss,hit%,l2read,l2hits,l2miss,l2hit%,arcsz,l2size,mrug,mf ug 1 That version of Arcstat is from http://github.com/mharsch/arcstat For comparison, I''ve got about 650GB of VMs on each of my two Nexenta VSAs (16GB/240GB L2ARC). When it''s just ticking over at 50-1000r/s then 99% of that is going to ARC but I''m also seeing patches where it goes 2-5k reads and I''m seeing 20-80% l2arc hits. These have been running for about a week and, given my understanding of how L2ARC fills, I''d suggest maybe leaving it to warm up longer (e.g. 1-2 weeks?) caveat: I''m a complete newbie to zfs so I could be completely wrong ;) Cheers, James _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 09/11/2012 03:32 PM, Dan Swartzendruber wrote:> I think you may have a point. I''m also inclined to enable prefetch caching > per Saso''s comment, since I don''t have massive throughput - latency is more > important to me.I meant to say the exact opposite: enable prefetch caching only if your l2arc is faster (in terms of bulk throughput) than your disks. Prefetch isn''t latency bound by its very definition, so it generally makes little reason to l2arc cache it. Cheers, -- Saso
LOL, I actually was unclear not you. I understood what you were saying, sorry for being unclear. I have 4 disks in raid10, so my max random read throughput is theoretically somewhat faster than the L2ARC device, but I never really do that intensive of reads. My point was: if a guest does read a bunch of data sequentially, that will trigger the prefetch L2ARC code path, correct? If so, I *do* want that cache in L2ARC, so that a return visit from that guest will hit as much as possible in the cache. One other thing (I don''t think I mentioned this): my entire ESXi dataset is only like 160GB (thin provisioning in action), so it seems to me, I should be able to fit the entire thing in L2ARC? -----Original Message----- From: Sa?o Kiselkov [mailto:skiselkov.ml at gmail.com] Sent: Tuesday, September 11, 2012 9:35 AM To: Dan Swartzendruber Cc: ''James H''; zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Interesting question about L2ARC On 09/11/2012 03:32 PM, Dan Swartzendruber wrote:> I think you may have a point. I''m also inclined to enable prefetch > caching per Saso''s comment, since I don''t have massive throughput - > latency is more important to me.I meant to say the exact opposite: enable prefetch caching only if your l2arc is faster (in terms of bulk throughput) than your disks. Prefetch isn''t latency bound by its very definition, so it generally makes little reason to l2arc cache it. Cheers, -- Saso
On 09/11/2012 03:41 PM, Dan Swartzendruber wrote:> LOL, I actually was unclear not you. I understood what you were saying, > sorry for being unclear. I have 4 disks in raid10, so my max random read > throughput is theoretically somewhat faster than the L2ARC device, but I > never really do that intensive of reads.But here''s the kicker, prefetch is never random, it''s always linear, so you need measure prefetch throughput against the near linear throughput of your disks. Your average 7k2 disk is capable of ~100MB/s in linear reads, so in a pair-of-mirrors scenario (raid10) you get effectively in excess of 400MB/s in prefetch throughput.> My point was: if a guest does read > a bunch of data sequentially, that will trigger the prefetch L2ARC code > path, correct?No, when a client does a linear read, the initial buffer is random, so that makes sense to serve from the l2arc - thus it is cached in the l2arc. Subsequently ZFS detects that the client is likely to want more buffers, so it will start prefetching the following blocks on the background. Then when the client returns, they will receive the blocks from ARC. The result is that the client wasn''t latency constrained to process this block, so there''s no need to cache the subsequently prefetched blocks in l2arc.> If so, I *do* want that cache in L2ARC, so that a return > visit from that guest will hit as much as possible in the cache.It will be in the normal ARC cache, however, l2arc is meant to primarily accelerate the initial block hit (as noted above), not the subsequently prefetched ones (which we have time to refetch from the main pool). This covers most generic filesystem use cases as well as random-read heavy workloads (such as databases, which rarely, if ever, do linear reads).> One other > thing (I don''t think I mentioned this): my entire ESXi dataset is only like > 160GB (thin provisioning in action), so it seems to me, I should be able to > fit the entire thing in L2ARC?Please try to post the output of this after you let it run on your dataset for a few minutes: $ arcstat.pl -f \ arcsz,read,dread,pread,hit%,miss%,l2size,l2read,l2hit%,l2miss% 60 It should give us a good idea of the kind of workload we''re dealing with and why your L2 hits are so low. Cheers, -- Saso
-----Original Message----- From: Sa?o Kiselkov [mailto:skiselkov.ml at gmail.com] Sent: Tuesday, September 11, 2012 9:52 AM To: Dan Swartzendruber Cc: ''James H''; zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Interesting question about L2ARC On 09/11/2012 03:41 PM, Dan Swartzendruber wrote:> LOL, I actually was unclear not you. I understood what you were > saying, sorry for being unclear. I have 4 disks in raid10, so my max > random read throughput is theoretically somewhat faster than the L2ARC > device, but I never really do that intensive of reads.But here''s the kicker, prefetch is never random, it''s always linear, so you need measure prefetch throughput against the near linear throughput of your disks. Your average 7k2 disk is capable of ~100MB/s in linear reads, so in a pair-of-mirrors scenario (raid10) you get effectively in excess of 400MB/s in prefetch throughput. ** True, and badly worded on my part. In theory, the 4 nearline SAS drives could deliver 600MB/sec, but my path to the guests is maybe 3GB (this is all running virtualized, so I can exceed gig/e speed.) The bottom line there is that no amount of read effort by guests (via the hypervisor) is going to come anywhere near the pool''s capabilities.> My point was: if a guest does read > a bunch of data sequentially, that will trigger the prefetch L2ARC > code path, correct?No, when a client does a linear read, the initial buffer is random, so that makes sense to serve from the l2arc - thus it is cached in the l2arc. Subsequently ZFS detects that the client is likely to want more buffers, so it will start prefetching the following blocks on the background. Then when the client returns, they will receive the blocks from ARC. The result is that the client wasn''t latency constrained to process this block, so there''s no need to cache the subsequently prefetched blocks in l2arc. *** Sorry, that''s what I meant by ''the prefetch l2arc code path''. E.g. the heuristics you referred to. It seems to me that if the client never wants the prefetched block it was a waste to cache it, if he does, at worst, he''ll miss once, and then it will be cached, since it will have been a demand read?> If so, I *do* want that cache in L2ARC, so that a return visit from > that guest will hit as much as possible in the cache.It will be in the normal ARC cache, however, l2arc is meant to primarily accelerate the initial block hit (as noted above), not the subsequently prefetched ones (which we have time to refetch from the main pool). This covers most generic filesystem use cases as well as random-read heavy workloads (such as databases, which rarely, if ever, do linear reads).> One other > thing (I don''t think I mentioned this): my entire ESXi dataset is only > like 160GB (thin provisioning in action), so it seems to me, I should > be able to fit the entire thing in L2ARC?Please try to post the output of this after you let it run on your dataset for a few minutes: $ arcstat.pl -f \ arcsz,read,dread,pread,hit%,miss%,l2size,l2read,l2hit%,l2miss% 60 It should give us a good idea of the kind of workload we''re dealing with and why your L2 hits are so low. *** Thanks a lot for clarifying how this works. Since I''m quite happy having an SSD in my workstation, I will need to purchase another SSD :) I''m wondering if it makes more sense to buy two SSDs of half the size (e.g. 128GB), since the total price is about the same?
On 09/11/2012 04:06 PM, Dan Swartzendruber wrote:> Thanks a lot for clarifying how this works.You''re very welcome.> Since I''m quite happy > having an SSD in my workstation, I will need to purchase another SSD :) I''m > wondering if it makes more sense to buy two SSDs of half the size (e.g. > 128GB), since the total price is about the same?If you have the space/ports and it costs the same, two SSDs will definitely give you better iops and throughput than a single SSD twice the size. Cheers, -- Saso
-----Original Message----- From: Sa?o Kiselkov [mailto:skiselkov.ml at gmail.com] Sent: Tuesday, September 11, 2012 10:12 AM To: Dan Swartzendruber Cc: ''James H''; zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Interesting question about L2ARC On 09/11/2012 04:06 PM, Dan Swartzendruber wrote:> Thanks a lot for clarifying how this works.You''re very welcome.> Since I''m quite happy > having an SSD in my workstation, I will need to purchase another SSD > :) I''m wondering if it makes more sense to buy two SSDs of half the size(e.g.> 128GB), since the total price is about the same?If you have the space/ports and it costs the same, two SSDs will definitely give you better iops and throughput than a single SSD twice the size. *** I have plenty of ports. 8 ports on an LSI HBA, one of which goes to the jbod expander/chassis, so connecting two SSDs is no issue. Thanks again...
Interesting, They are in and running and I''m hitting on them pretty good. One thing I did change: I''ve seen recommendations (one in particular from nexenta) that recommends an 8KB recordsize for virtualization workloads (which is 100% of my workload). I Svmotion''ed all the VMs off the datastore and back (to get new smaller recordsize.) I wonder if that has an effect? -----Original Message----- From: Sa?o Kiselkov [mailto:skiselkov.ml at gmail.com] Sent: Tuesday, September 11, 2012 10:12 AM To: Dan Swartzendruber Cc: ''James H''; zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Interesting question about L2ARC On 09/11/2012 04:06 PM, Dan Swartzendruber wrote:> Thanks a lot for clarifying how this works.You''re very welcome.> Since I''m quite happy > having an SSD in my workstation, I will need to purchase another SSD > :) I''m wondering if it makes more sense to buy two SSDs of half the size(e.g.> 128GB), since the total price is about the same?If you have the space/ports and it costs the same, two SSDs will definitely give you better iops and throughput than a single SSD twice the size. Cheers, -- Saso
2012-09-11 16:29, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Dan Swartzendruber >> >> My first thought was everything is >> hitting in ARC, but that is clearly not the case, since it WAS gradually filling up >> the cache device. > > When things become colder in the ARC, they expire to the L2ARC (or simply expire, bypassing the L2ARC). So it''s normal to start filling the L2ARC, even if you never hit anything in the L2ARC.Got me wondering: how many reads of a block from spinning rust suffice for it to ultimately get into L2ARC? Just one so it gets into a recent-read list of the ARC and then expires into L2ARC when ARC RAM is more needed for something else, and only when that L2ARC fills up does the block expire from these caches completely? Thanks, and sorry for a lame question ;) //Jim
On 9/25/2012 3:38 PM, Jim Klimov wrote:> 2012-09-11 16:29, Edward Ned Harvey > (opensolarisisdeadlongliveopensolaris) wrote: >>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >>> bounces at opensolaris.org] On Behalf Of Dan Swartzendruber >>> >>> My first thought was everything is >>> hitting in ARC, but that is clearly not the case, since it WAS >>> gradually filling up >>> the cache device. >> >> When things become colder in the ARC, they expire to the L2ARC (or >> simply expire, bypassing the L2ARC). So it''s normal to start filling >> the L2ARC, even if you never hit anything in the L2ARC. > > > Got me wondering: how many reads of a block from spinning rust > suffice for it to ultimately get into L2ARC? Just one so it > gets into a recent-read list of the ARC and then expires into > L2ARC when ARC RAM is more needed for something else, and only > when that L2ARC fills up does the block expire from these caches > completely?Good question. I don''t remember if I posted my final status, but I put in 2 128GB SSDs and it''s hitting them just fine. The working set seems to be right on 110GB.
On 09/25/2012 09:38 PM, Jim Klimov wrote:> 2012-09-11 16:29, Edward Ned Harvey > (opensolarisisdeadlongliveopensolaris) wrote: >>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >>> bounces at opensolaris.org] On Behalf Of Dan Swartzendruber >>> >>> My first thought was everything is >>> hitting in ARC, but that is clearly not the case, since it WAS >>> gradually filling up >>> the cache device. >> >> When things become colder in the ARC, they expire to the L2ARC (or >> simply expire, bypassing the L2ARC). So it''s normal to start filling >> the L2ARC, even if you never hit anything in the L2ARC. > > > Got me wondering: how many reads of a block from spinning rust > suffice for it to ultimately get into L2ARC? Just one so it > gets into a recent-read list of the ARC and then expires into > L2ARC when ARC RAM is more needed for something else, and only > when that L2ARC fills up does the block expire from these caches > completely? > > Thanks, and sorry for a lame question ;)Correct. See https://github.com/illumos/illumos-gate/blob/14d44f2248cc2a54490db7f7caa4da5968f90837/usr/src/uts/common/fs/zfs/arc.c#L3685 for an exact description of the ARC<->L2ARC interaction mechanism. Cheers, -- Saso
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Sep-26 11:14 UTC
[zfs-discuss] Interesting question about L2ARC
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > Got me wondering: how many reads of a block from spinning rust > suffice for it to ultimately get into L2ARC? Just one so it > gets into a recent-read list of the ARC and then expires into > L2ARC when ARC RAM is more needed for something else,Correct, but not always sufficient. I forget the name of the parameter, but there''s some rate limiting thing that limits how fast you can fill the L2ARC. This means sometimes, things will expire from ARC, and simply get discarded.
On 09/26/2012 01:14 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Jim Klimov >> >> Got me wondering: how many reads of a block from spinning rust >> suffice for it to ultimately get into L2ARC? Just one so it >> gets into a recent-read list of the ARC and then expires into >> L2ARC when ARC RAM is more needed for something else, > > Correct, but not always sufficient. I forget the name of the parameter, but there''s some rate limiting thing that limits how fast you can fill the L2ARC. This means sometimes, things will expire from ARC, and simply get discarded.The parameters are: *) l2arc_write_max (default 8MB): max number of bytes written per fill cycle *) l2arc_headroom (default 2x): multiplies the above parameter and determines how far into the ARC lists we will search for buffers eligible for writing to L2ARC. *) l2arc_feed_secs (default 1s): regular interval between fill cycles *) l2arc_feed_min_ms (default 200ms): minimum interval between fill cycles Cheers, -- Saso
On Sep 26, 2012, at 4:28 AM, Sa?o Kiselkov <skiselkov.ml at gmail.com> wrote:> On 09/26/2012 01:14 PM, Edward Ned Harvey > (opensolarisisdeadlongliveopensolaris) wrote: >>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >>> bounces at opensolaris.org] On Behalf Of Jim Klimov >>> >>> Got me wondering: how many reads of a block from spinning rust >>> suffice for it to ultimately get into L2ARC? Just one so it >>> gets into a recent-read list of the ARC and then expires into >>> L2ARC when ARC RAM is more needed for something else, >> >> Correct, but not always sufficient. I forget the name of the parameter, but there''s some rate limiting thing that limits how fast you can fill the L2ARC. This means sometimes, things will expire from ARC, and simply get discarded. > > The parameters are: > > *) l2arc_write_max (default 8MB): max number of bytes written per > fill cycleIt should be noted that this level was perhaps appropriate 6 years ago, when L2ARC was integrated and given the SSDs available at the time, but is well below reasonable settings for high speed systems or modern SSDs. It is probably not a bad idea to change the default to reflect more modern systems, thus avoiding surprises. -- richard> *) l2arc_headroom (default 2x): multiplies the above parameter and > determines how far into the ARC lists we will search for buffers > eligible for writing to L2ARC. > *) l2arc_feed_secs (default 1s): regular interval between fill cycles > *) l2arc_feed_min_ms (default 200ms): minimum interval between fill > cycles > > Cheers, > -- > Saso > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- illumos Day & ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120926/77e4ab08/attachment-0001.html>