Hi All, I''d like to ask about whether there is a method to enforce a certain txg commit frequency on ZFS. I''m doing a large amount of video streaming from a storage pool while also slowly continuously writing a constant volume of data to it (using a normal file descriptor, *not* in O_SYNC). When reading volume goes over a certain threshold (and average pool load over ~50%), ZFS thinks it''s running out of steam on the storage pool and starts committing transactions more often which results in even greater load on the pool. This leads to a sudden spike in I/O utilization on the pool in roughly the following method: # streaming clients pool load [%] 15 8% 20 11% 40 22% 60 33% 80 44% --- around here txg timeouts start to shorten --- 85 60% 90 70% 95 85% My application does a fair bit of caching and prefetching, so I have zfetch disabled and primarycache set to only metadata. Also, reads happen (on a per client basis) relatively infrequently, so I can easily take it if the pool stops reading for a few seconds and just writes data. The problem is, ZFS starts alternating between reads and writes really quickly, which in turn starves me on IOPS and results in a huge load spike. Judging on load numbers up to around 80 concurrent clients, I suspect I could go up to 150 concurrent clients on this pool, but because of this spike I top out at around 95-100 concurrent clients. Regards, -- Saso
On 06/24/2011 02:29 PM, Sa?o Kiselkov wrote:> Hi All, > > I''d like to ask about whether there is a method to enforce a certain txg > commit frequency on ZFS. I''m doing a large amount of video streaming > from a storage pool while also slowly continuously writing a constant > volume of data to it (using a normal file descriptor, *not* in O_SYNC). > When reading volume goes over a certain threshold (and average pool load > over ~50%), ZFS thinks it''s running out of steam on the storage pool and > starts committing transactions more often which results in even greater > load on the pool. This leads to a sudden spike in I/O utilization on the > pool in roughly the following method: > > # streaming clients pool load [%] > 15 8% > 20 11% > 40 22% > 60 33% > 80 44% > --- around here txg timeouts start to shorten --- > 85 60% > 90 70% > 95 85% > > My application does a fair bit of caching and prefetching, so I have > zfetch disabled and primarycache set to only metadata. Also, reads > happen (on a per client basis) relatively infrequently, so I can easily > take it if the pool stops reading for a few seconds and just writes > data. The problem is, ZFS starts alternating between reads and writes > really quickly, which in turn starves me on IOPS and results in a huge > load spike. Judging on load numbers up to around 80 concurrent clients, > I suspect I could go up to 150 concurrent clients on this pool, but > because of this spike I top out at around 95-100 concurrent clients. > > Regards, > -- > SasoCan please anybody comment on this? Is it even possible. Regards, -- Saso
On Jun 24, 2011, at 5:29 AM, Sa?o Kiselkov wrote:> Hi All, > > I''d like to ask about whether there is a method to enforce a certain txg > commit frequency on ZFS. I''m doing a large amount of video streaming > from a storage pool while also slowly continuously writing a constant > volume of data to it (using a normal file descriptor, *not* in O_SYNC). > When reading volume goes over a certain threshold (and average pool load > over ~50%), ZFS thinks it''s running out of steam on the storage pool and > starts committing transactions more often which results in even greater > load on the pool. This leads to a sudden spike in I/O utilization on the > pool in roughly the following method: > > # streaming clients pool load [%] > 15 8% > 20 11% > 40 22% > 60 33% > 80 44% > --- around here txg timeouts start to shorten --- > 85 60% > 90 70% > 95 85%What is a "pool load"? We expect 100% utilization during the txg commit, anything else is a waste. I suspect that you actually want more, smaller commits to spread the load more evenly. This is easy to change, but unless you can tell us what OS you are running, including version, we don''t have a foundation to build upon. -- richard> > My application does a fair bit of caching and prefetching, so I have > zfetch disabled and primarycache set to only metadata. Also, reads > happen (on a per client basis) relatively infrequently, so I can easily > take it if the pool stops reading for a few seconds and just writes > data. The problem is, ZFS starts alternating between reads and writes > really quickly, which in turn starves me on IOPS and results in a huge > load spike. Judging on load numbers up to around 80 concurrent clients, > I suspect I could go up to 150 concurrent clients on this pool, but > because of this spike I top out at around 95-100 concurrent clients. > > Regards, > -- > Saso > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 06/26/2011 06:17 PM, Richard Elling wrote:> > On Jun 24, 2011, at 5:29 AM, Sa?o Kiselkov wrote: > >> Hi All, >> >> I''d like to ask about whether there is a method to enforce a certain txg >> commit frequency on ZFS. I''m doing a large amount of video streaming >> from a storage pool while also slowly continuously writing a constant >> volume of data to it (using a normal file descriptor, *not* in O_SYNC). >> When reading volume goes over a certain threshold (and average pool load >> over ~50%), ZFS thinks it''s running out of steam on the storage pool and >> starts committing transactions more often which results in even greater >> load on the pool. This leads to a sudden spike in I/O utilization on the >> pool in roughly the following method: >> >> # streaming clients pool load [%] >> 15 8% >> 20 11% >> 40 22% >> 60 33% >> 80 44% >> --- around here txg timeouts start to shorten --- >> 85 60% >> 90 70% >> 95 85% > > What is a "pool load"? We expect 100% utilization during the txg commit, > anything else is a waste. > > I suspect that you actually want more, smaller commits to spread the load > more evenly. This is easy to change, but unless you can tell us what OS > you are running, including version, we don''t have a foundation to build upon. > -- richardPool load is a 60 seconds average of the aggregated util percentages as reported by "iostat -D" for the disks which comprise the pool (So I run "iostat -Dn {pool-disks} 60" and compute the load for each row printed as an average of the "util" columns). Interestingly enough, when watching 1-second updates in iostat I never see "util" hit 100% during a txg commit, even if it takes two or more seconds to complete. This tells me that the disks still have enough performance headroom so that zfs doesn''t really need to shorten the interval at which commits occur. I''m running oi_148, and all pools are zfs version 28. Regards, -- Saso
> I''d like to ask about whether there is a method to enforce a > certain txg > commit frequency on ZFS.Well, there is a timer frequency based on TXG age (i.e 5 sec by default now), in /etc/system like this: set zfs:zfs_txg_synctime = 5 Also there is a buffer-size limit, like this (384Mb): set zfs:zfs_write_limit_override = 0x18000000 or on command-line like this: # echo zfs_write_limit_override/W0t402653184 | mdb -kw We had similar spikes with big writes to a Thumper with SXCE in the pre-90''s builds, when the system would stall for seconds while flushing a 30-second TXG full of data. Adding a reasonable megabyte limit solved the unresponsiveness problem for us, by making these flush-writes rather small and quick. See also: http://opensolaris.org/jive/thread.jspa?threadID=106453&start=15&tstart=0 http://opensolaris.org/jive/thread.jspa?messageID=347212 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110627/262e4186/attachment.html>
On 06/27/2011 11:59 AM, Jim Klimov wrote:> >> I''d like to ask about whether there is a method to enforce a >> certain txg >> commit frequency on ZFS. > > Well, there is a timer frequency based on TXG age (i.e 5 sec > by default now), in /etc/system like this: > > set zfs:zfs_txg_synctime = 5When trying to read the value through mdb I get: # echo zfs_txg_synctime::print | mdb -k mdb: failed to dereference symbol: unknown symbol name Is this some new addition in S11E?> Also there is a buffer-size limit, like this (384Mb): > set zfs:zfs_write_limit_override = 0x18000000 > > or on command-line like this: > # echo zfs_write_limit_override/W0t402653184 | mdb -kwCurrently my value for this is 0. How should I set it? I''m writing ~15MB/s and would like txg flushes to occur at most once every 10 seconds. Should I set it to 150MB then?> We had similar spikes with big writes to a Thumper with SXCE in the pre-90''s > builds, when the system would stall for seconds while flushing a 30-second TXG > full of data. Adding a reasonable megabyte limit solved the unresponsiveness > problem for us, by making these flush-writes rather small and quick.I need to do the opposite - I don''t need to shorten the interval of writes, I need to increase it. Can I do that using zfs_write_limit_override? Thanks. -- Saso
2011-06-29 16:33, Sa?o Kiselkov ?????:> On 06/27/2011 11:59 AM, Jim Klimov wrote: >>> I''d like to ask about whether there is a method to enforce a >>> certain txg >>> commit frequency on ZFS. >> >> Well, there is a timer frequency based on TXG age (i.e 5 sec >> by default now), in /etc/system like this: >> >> set zfs:zfs_txg_synctime = 5 > When trying to read the value through mdb I get: > > # echo zfs_txg_synctime::print | mdb -k > mdb: failed to dereference symbol: unknown symbol name > > Is this some new addition in S11E? >No, it is in OpenSolaris (SXCE in my case) and Solaris 10, like, forever. My sxce 117, sol10u6 x86 and sol10u8 sparc boxes all returned 0x5: # echo zfs_txg_synctime::print | mdb -k 0x5 So I am puzzled why you don''t have that value... Please also see if you have its possible alternate names though, i.e.: set zfs:zfs_txg_synctime = 5 set zfs:zfs_txg_timeout = 5 set zfs:zfs_txg_synctime_ms = 5000>> Also there is a buffer-size limit, like this (384Mb): >> set zfs:zfs_write_limit_override = 0x18000000 >> >> or on command-line like this: >> # echo zfs_write_limit_override/W0t402653184 | mdb -kw > Currently my value for this is 0. How should I set it? I''m writing > ~15MB/s and would like txg flushes to occur at most once every 10 > seconds. Should I set it to 150MB then?I guess so, if you know your flows so precisely. AFAIK the default zero value disables a specific buffer size limit and instead uses ZFS''s autocalculated size limit (some percentage of RAM/ARC size), or the TXG sync time. In our case we estimated an acceptable time-of-writing of the buffered data which does not give substantial service stalls even if the system is otherwise unresponsive during a cache flush to disk. Apparently, smaller values can lead to excessive mechanical IOs and fragmentation, so there is some balance to find and strike...>> We had similar spikes with big writes to a Thumper with SXCE in the pre-90''s >> builds, when the system would stall for seconds while flushing a 30-second TXG >> full of data. Adding a reasonable megabyte limit solved the unresponsiveness >> problem for us, by making these flush-writes rather small and quick. > I need to do the opposite - I don''t need to shorten the interval of > writes, I need to increase it. Can I do that using zfs_write_limit_override?I am not sure about that - if the TXG sync timer strikes before the buffer size is overflowed, flushes should occur. Alternatively, if you buffered too many writes too quickly (before the TXG sync timer chimes in) - flushes should occur. -- +============================================================+ | | | ?????? ???????, Jim Klimov | | ??????????? ???????? CTO | | ??? "??? ? ??" JSC COS&HT | | | | +7-903-7705859 (cellular) mailto:jimklimov at cos.ru | | CC:admin at cos.ru,jimklimov at mail.ru | +============================================================+ | () ascii ribbon campaign - against html mail | | /\ - against microsoft attachments | +============================================================+
On 06/29/2011 02:33 PM, Sa?o Kiselkov wrote:>> Also there is a buffer-size limit, like this (384Mb): >> set zfs:zfs_write_limit_override = 0x18000000 >> >> or on command-line like this: >> # echo zfs_write_limit_override/W0t402653184 | mdb -kw > > Currently my value for this is 0. How should I set it? I''m writing > ~15MB/s and would like txg flushes to occur at most once every 10 > seconds. Should I set it to 150MB then? > >> We had similar spikes with big writes to a Thumper with SXCE in the pre-90''s >> builds, when the system would stall for seconds while flushing a 30-second TXG >> full of data. Adding a reasonable megabyte limit solved the unresponsiveness >> problem for us, by making these flush-writes rather small and quick. > > I need to do the opposite - I don''t need to shorten the interval of > writes, I need to increase it. Can I do that using zfs_write_limit_override?Just as a folloup, I''ve had a look at the tunables in dsl_pool.c and found that I could potentially influence the write pressure calculation by tuning zfs_txg_synctime_ms - do you think increasing this value from its default (1000ms) help me lower the write scheduling frequency? (I don''t mind if a txg write takes even twice as long, my application buffers are on average 6 seconds long.) Regards, -- Saso
2011-06-30 11:47, Sa?o Kiselkov ?????:> On 06/30/2011 02:49 AM, Jim Klimov wrote: >> 2011-06-30 2:21, Sa?o Kiselkov ?????: >>> On 06/29/2011 02:33 PM, Sa?o Kiselkov wrote: >>>>> Also there is a buffer-size limit, like this (384Mb): >>>>> set zfs:zfs_write_limit_override = 0x18000000 >>>>> >>>>> or on command-line like this: >>>>> # echo zfs_write_limit_override/W0t402653184 | mdb -kw >>>> Currently my value for this is 0. How should I set it? I''m writing >>>> ~15MB/s and would like txg flushes to occur at most once every 10 >>>> seconds. Should I set it to 150MB then? >>>> >>>>> We had similar spikes with big writes to a Thumper with SXCE in the >>>>> pre-90''s >>>>> builds, when the system would stall for seconds while flushing a >>>>> 30-second TXG >>>>> full of data. Adding a reasonable megabyte limit solved the >>>>> unresponsiveness >>>>> problem for us, by making these flush-writes rather small and quick. >>>> I need to do the opposite - I don''t need to shorten the interval of >>>> writes, I need to increase it. Can I do that using >>>> zfs_write_limit_override? >>> Just as a folloup, I''ve had a look at the tunables in dsl_pool.c and >>> found that I could potentially influence the write pressure calculation >>> by tuning zfs_txg_synctime_ms - do you think increasing this value from >>> its default (1000ms) help me lower the write scheduling frequency? (I >>> don''t mind if a txg write takes even twice as long, my application >>> buffers are on average 6 seconds long.) >>> >>> Regards, >>> -- >>> Saso >> It might help. In my limited testing on oi_148a, >> it seems that zfs_txg_synctime_ms and zfs_txg_timeout >> are linked somehow (i.e. changing one value changed the >> other accordingly). So in effect they may be two names >> for the same tunable (one in single units of secs, another >> in thousands of msecs). > Well, to my understanding, zfs_txg_timeout is the timer limit on > flushing pending txgs to disk - if the timer fires the current txg is > written to disk regardless of its size. Otherwise the txg scheduling > algorithm should take into account I/O pressure on the pool, estimate > the remaining write bandwidth and fire when it estimates that a txg > commit would overflow zfs_txg_synctime[_ms]. I tried increasing this > value to 2000 or 3000, but without an effect - prehaps I need to set it > at pool mount time or in /etc/system. Could somebody with more knowledge > of these internals please chime in?Somewhere in our discussion the "Reply-to-all" was lost. Back to the list :) Saso: Did you try setting both the timeout limit and the megabyte limit values, and did you see system IO patterns correlate with these values? My understanding was lke yours above, so if things are different in reality - I''m interested to know too. PS: I don''t think you wrote: which OS version do you use? //Jim
On 06/30/2011 01:10 PM, Jim Klimov wrote:> 2011-06-30 11:47, Sa?o Kiselkov ?????: >> On 06/30/2011 02:49 AM, Jim Klimov wrote: >>> 2011-06-30 2:21, Sa?o Kiselkov ?????: >>>> On 06/29/2011 02:33 PM, Sa?o Kiselkov wrote: >>>>>> Also there is a buffer-size limit, like this (384Mb): >>>>>> set zfs:zfs_write_limit_override = 0x18000000 >>>>>> >>>>>> or on command-line like this: >>>>>> # echo zfs_write_limit_override/W0t402653184 | mdb -kw >>>>> Currently my value for this is 0. How should I set it? I''m writing >>>>> ~15MB/s and would like txg flushes to occur at most once every 10 >>>>> seconds. Should I set it to 150MB then? >>>>> >>>>>> We had similar spikes with big writes to a Thumper with SXCE in the >>>>>> pre-90''s >>>>>> builds, when the system would stall for seconds while flushing a >>>>>> 30-second TXG >>>>>> full of data. Adding a reasonable megabyte limit solved the >>>>>> unresponsiveness >>>>>> problem for us, by making these flush-writes rather small and quick. >>>>> I need to do the opposite - I don''t need to shorten the interval of >>>>> writes, I need to increase it. Can I do that using >>>>> zfs_write_limit_override? >>>> Just as a folloup, I''ve had a look at the tunables in dsl_pool.c and >>>> found that I could potentially influence the write pressure calculation >>>> by tuning zfs_txg_synctime_ms - do you think increasing this value from >>>> its default (1000ms) help me lower the write scheduling frequency? (I >>>> don''t mind if a txg write takes even twice as long, my application >>>> buffers are on average 6 seconds long.) >>>> >>>> Regards, >>>> -- >>>> Saso >>> It might help. In my limited testing on oi_148a, >>> it seems that zfs_txg_synctime_ms and zfs_txg_timeout >>> are linked somehow (i.e. changing one value changed the >>> other accordingly). So in effect they may be two names >>> for the same tunable (one in single units of secs, another >>> in thousands of msecs). >> Well, to my understanding, zfs_txg_timeout is the timer limit on >> flushing pending txgs to disk - if the timer fires the current txg is >> written to disk regardless of its size. Otherwise the txg scheduling >> algorithm should take into account I/O pressure on the pool, estimate >> the remaining write bandwidth and fire when it estimates that a txg >> commit would overflow zfs_txg_synctime[_ms]. I tried increasing this >> value to 2000 or 3000, but without an effect - prehaps I need to set it >> at pool mount time or in /etc/system. Could somebody with more knowledge >> of these internals please chime in? > > > Somewhere in our discussion the "Reply-to-all" was lost. > Back to the list :) > > Saso: Did you try setting both the timeout limit and the > megabyte limit values, and did you see system IO patterns > correlate with these values? > > My understanding was lke yours above, so if things are > different in reality - I''m interested to know too. > > PS: I don''t think you wrote: which OS version do you use?Thanks for the suggestions, I''ll try them out. I''m running oi_148. Regards, -- Saso
2011-06-30 15:22, Sa?o Kiselkov ?????:> I tried increasing this >>> value to 2000 or 3000, but without an effect - prehaps I need to set it >>> at pool mount time or in /etc/system. Could somebody with more knowledge >>> of these internals please chime in? >>And about this part - it was my understanding and experience (from SXCE) that these values can be set at run-time and are used as soon as set (or maybe in a few TXGs - but visibly in real-time). Also I''ve seen instant result from setting the TXG sync times on oi_148a with little loads (in my thread about trying to account for some 2Mb writes to my root pool) - this could be 2Mb/s or 0.2Mb/s (all in 2Mb bursts though) depending on TXG timeout currently set value. //Jim
On 06/30/2011 01:33 PM, Jim Klimov wrote:> 2011-06-30 15:22, Sa?o Kiselkov ?????: >> I tried increasing this >>>> value to 2000 or 3000, but without an effect - prehaps I need to set it >>>> at pool mount time or in /etc/system. Could somebody with more >>>> knowledge >>>> of these internals please chime in? >>> > > And about this part - it was my understanding and experience > (from SXCE) that these values can be set at run-time and are > used as soon as set (or maybe in a few TXGs - but visibly in > real-time). > > Also I''ve seen instant result from setting the TXG sync times > on oi_148a with little loads (in my thread about trying to > account for some 2Mb writes to my root pool) - this could be > 2Mb/s or 0.2Mb/s (all in 2Mb bursts though) depending on TXG > timeout currently set value. >Hm, it appears I''ll have to do some reboots and more extensive testing. I tried tuning various settings and then returned everything back to the defaults. Yet, now I can ramp the number of concurrent output streams to ~170 instead of the original 95 (even then the pool still has capacity left, I''m actually running out of CPU power). The txg commit occurs at roughly every 15 (or so) seconds, which is what I wanted. Strange that this occurs even after I returned everything to the defaults... I''ll try doing some more testing on this once I move the production deployment to a different system and I''ll have more time to experiment with this machine. Anyways, thanks for the suggestions, it helped a lot. Regards, -- Saso
On 06/30/2011 11:56 PM, Sa?o Kiselkov wrote:> On 06/30/2011 01:33 PM, Jim Klimov wrote: >> 2011-06-30 15:22, Sa?o Kiselkov ?????: >>> I tried increasing this >>>>> value to 2000 or 3000, but without an effect - prehaps I need to set it >>>>> at pool mount time or in /etc/system. Could somebody with more >>>>> knowledge >>>>> of these internals please chime in? >>>> >> >> And about this part - it was my understanding and experience >> (from SXCE) that these values can be set at run-time and are >> used as soon as set (or maybe in a few TXGs - but visibly in >> real-time). >> >> Also I''ve seen instant result from setting the TXG sync times >> on oi_148a with little loads (in my thread about trying to >> account for some 2Mb writes to my root pool) - this could be >> 2Mb/s or 0.2Mb/s (all in 2Mb bursts though) depending on TXG >> timeout currently set value. >> > > Hm, it appears I''ll have to do some reboots and more extensive testing. > I tried tuning various settings and then returned everything back to the > defaults. Yet, now I can ramp the number of concurrent output streams to > ~170 instead of the original 95 (even then the pool still has capacity > left, I''m actually running out of CPU power). The txg commit occurs at > roughly every 15 (or so) seconds, which is what I wanted. Strange that > this occurs even after I returned everything to the defaults... I''ll try > doing some more testing on this once I move the production deployment to > a different system and I''ll have more time to experiment with this > machine. Anyways, thanks for the suggestions, it helped a lot. > > Regards, > -- > SasoJust a follow correction: one parameter was indeed changed: zfs_write_limit_inflated. In the source it''s set to zero, I''ve set it to 0x200000000. Regards, -- Saso
On 07/01/2011 12:01 AM, Sa?o Kiselkov wrote:> On 06/30/2011 11:56 PM, Sa?o Kiselkov wrote: >> Hm, it appears I''ll have to do some reboots and more extensive testing. >> I tried tuning various settings and then returned everything back to the >> defaults. Yet, now I can ramp the number of concurrent output streams to >> ~170 instead of the original 95 (even then the pool still has capacity >> left, I''m actually running out of CPU power). The txg commit occurs at >> roughly every 15 (or so) seconds, which is what I wanted. Strange that >> this occurs even after I returned everything to the defaults... I''ll try >> doing some more testing on this once I move the production deployment to >> a different system and I''ll have more time to experiment with this >> machine. Anyways, thanks for the suggestions, it helped a lot. >> >> Regards, >> -- >> Saso > > Just a follow correction: one parameter was indeed changed: > zfs_write_limit_inflated. In the source it''s set to zero, I''ve set it to > 0x200000000.So it seems I was wrong after all and it didn''t help. So the question remains: is there a way how to force ZFS *NOT* to commit a txg before a certain minimum amount of data has accumulated in it, or before the txg timeout is reached? All the best, -- Saso