Shane Cox
2010-Feb-25 20:31 UTC
[zfs-discuss] application writes are blocked near the end of spa_sync
I''m new to ZFS and looking for some assistance with a performance problem: At the interval of zfs_txg_timeout (I''m using the default of 30), I observe 100-200ms pauses in my application. Based on my application log files, it appears that the write() system call is blocked. Digging deeper into the problem with DTrace, I discovered that most of the time/pause is spent in the function zil_itx_assign. The output below illustrates how several write system calls are paused in the zil_itx_assign function for approximately 180ms ... then all of them resume simultaneously. Of interest, this behavior only occurs at the very end of spa_sync. Timestamp CPU Function 82682408087424 10 write:entry 82682408099903 10 zil_itx_assign:entry 82682408102896 10 zil_itx_assign:return 82682408106429 10 write:return Elapsed Time in micros: 19 82682408174003 10 write:entry 82682408184519 10 zil_itx_assign:entry ## Pause begins here when entering zil_itx_assign 82682408205410 6 write:entry 82682408213613 11 spa_sync:return 82682408215819 6 zil_itx_assign:entry 82682408237359 2 write:entry 82682408252117 2 zil_itx_assign:entry ## Pause ends here 82682588006927 6 zil_itx_assign:return 82682588006967 10 zil_itx_assign:return 82682588007659 2 zil_itx_assign:return 82682588020317 6 write:return Elapsed Time in micros: 179814 82682588020321 10 write:return Elapsed Time in micros: 179846 82682588020326 2 write:return Elapsed Time in micros: 179782 Here is the output of lockstat -cwPW that further implicates zil_itx_assign: Adaptive mutex block: 222 events in 5.014 seconds (44 events/sec) Count indv cuml rcnt nsec Hottest Lock Hottest Caller ------------------------------------------------------------------------------- 1 99% 99% 0.00 *185876716* 0xffffffff82a7e4c0 zil_itx_assign+0x22 86 0% 100% 0.00 5424 0xffffffffa17891f8 cv_timedwait_sig_hires+0x2d4 53 0% 100% 0.00 2328 0xffffffff9f7ef3a0 udp_rcv_enqueue+0xa0 8 0% 100% 0.00 11664 0xffffffffa0016f18 vdev_queue_io_done+0x3a 15 0% 100% 0.00 5350 0xffffffff85962b40 clock_tick_process+0xc4 Looking at the source for zil_itx_assign, it looks like I''m contending for a mutex on the zilog, but I''m not familiar enough with ZFS to know why. Any help/suggestions would be appreciated. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100225/b42cc5f9/attachment.html>
Bob Friesenhahn
2010-Feb-26 01:47 UTC
[zfs-discuss] application writes are blocked near the end of spa_sync
On Thu, 25 Feb 2010, Shane Cox wrote:> I''m new to ZFS and looking for some assistance with a performance problem: > > At the interval of zfs_txg_timeout (I''m using the default of 30), I observe 100-200ms > pauses in my application.? Based on my application log files, it appears that the > write() system call is blocked.? Digging deeper into the problem with DTrace, IIf you check the forum archives you will find quite a few long discussion threads about this issue. I initiated at least one of them. When zfs writes a transaction group there will be a stall in an application which writes continuously. The main thing you can do is to adjust zfs tunables to limit the size of a transaction group, or to increase the frequency of transaction group commits. One such tunable is zfs:zfs_write_limit_override set in /etc/system. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Shane Cox
2010-Feb-26 15:22 UTC
[zfs-discuss] application writes are blocked near the end of spa_sync
Bob, Thanks for your reply. As you mentioned, adjusting the zfs tunables to reduce the transaction group size yields shorter but more frequent pauses. Unfortunately, this workaround doesn''t sufficiently meet our needs (pauses are still too long). I''ve reviewed the forum archives and read a number of threads related to this issue. However I didn''t find a root-cause explanation for these pauses, only talk of how to ameliorate them. In my particular case, I would like to know why zfs_log_writes are blocked for 180ms on a mutex (seemingly blocked on the intent log itself) when performing zil_itx_assign. Another thread must have a lock on the intent log, no? Overall, the system appears healthy as other system calls (e.g., reads and writes to network devices) complete successfully while writes to the intent log are blocked ... so the problem seems to be access to the zfs intent log. Any additional insight would be appreciated. Thanks On Thu, Feb 25, 2010 at 8:47 PM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Thu, 25 Feb 2010, Shane Cox wrote: > > I''m new to ZFS and looking for some assistance with a performance problem: >> >> At the interval of zfs_txg_timeout (I''m using the default of 30), I >> observe 100-200ms >> pauses in my application. Based on my application log files, it appears >> that the >> write() system call is blocked. Digging deeper into the problem with >> DTrace, I >> > > If you check the forum archives you will find quite a few long discussion > threads about this issue. I initiated at least one of them. > > When zfs writes a transaction group there will be a stall in an application > which writes continuously. The main thing you can do is to adjust zfs > tunables to limit the size of a transaction group, or to increase the > frequency of transaction group commits. One such tunable is > > zfs:zfs_write_limit_override > > set in /etc/system. > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/<http://www.graphicsmagick.org/>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100226/ed1eb595/attachment.html>
Bob Friesenhahn
2010-Feb-26 18:29 UTC
[zfs-discuss] application writes are blocked near the end of spa_sync
On Fri, 26 Feb 2010, Shane Cox wrote:> ? > I''ve reviewed the forum archives and read a number of threads related to this issue.? However I > didn''t find a root-cause explanation for these pauses, only talk of how to ameliorate them.? In my > particular case, I would like to know why zfs_log_writes?are blocked for 180ms on a mutex (seemingly > blocked on the intent log itself) when performing zil_itx_assign.??Another thread must have a lock on > the intent log, no?? Overall, the system?appears healthy as other system calls (e.g., reads and > writes to network devices) complete successfully while writes to the intent log are blocked ... so > the problem seems to be?access to the zfs intent log. > Any additional insight would be appreciated.As far as I am aware, none of the zfs authors has been willing to address this issue in public. It is not clear (to me) if the fundmental design of zfs transaction groups requires that writes stop briefly until the transaction group has been flushed to disk. I suspect that this is the case. Perhaps zfs will never meet your timing requirements. Others here have had considerable success by using RAID interface adaptor cards with battery-backed cache memory and configuring those cards to "IT" JBOD mode. By limiting the TXG group size to the amount which will fit in battery-backed cache memory, the time to "commit" the TXG group is dramatically reduced as long as the continual write rate does not exceed what the backing disks can sustain. Unfortunately, this may increase the total amount of data written to underlying storage. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Zhu Han
2010-Feb-27 04:30 UTC
[zfs-discuss] application writes are blocked near the end of spa_sync
Hi, This page may indicate the root cause. http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle ZFS will throttle the write speed to match the write speed to the txg to the speed of DISK IO. If it detects the modest measure(1 tick pause) cannot prevent the tx group from being too large, it adopts a way to stall all write request. That could be the situation you have observed. However, please be notice, this is may not correct since I''m not a developer of ZFS. For a workaround, you may add more disk to ZFS pool to get more bandwidth to alleviate the problem. Or you may want to disable write throttling if you are sure the write just bursts in an extended time. Again, I''m not sure whether the latter solution is feasible. best regards, hanzhu On Sat, Feb 27, 2010 at 2:29 AM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Fri, 26 Feb 2010, Shane Cox wrote: > >> >> I''ve reviewed the forum archives and read a number of threads related to >> this issue. However I >> didn''t find a root-cause explanation for these pauses, only talk of how to >> ameliorate them. In my >> particular case, I would like to know why zfs_log_writes are blocked for >> 180ms on a mutex (seemingly >> blocked on the intent log itself) when performing zil_itx_assign. Another >> thread must have a lock on >> the intent log, no? Overall, the system appears healthy as other system >> calls (e.g., reads and >> writes to network devices) complete successfully while writes to the >> intent log are blocked ... so >> the problem seems to be access to the zfs intent log. >> Any additional insight would be appreciated. >> > > As far as I am aware, none of the zfs authors has been willing to address > this issue in public. It is not clear (to me) if the fundmental design of > zfs transaction groups requires that writes stop briefly until the > transaction group has been flushed to disk. I suspect that this is the > case. > > Perhaps zfs will never meet your timing requirements. Others here have had > considerable success by using RAID interface adaptor cards with > battery-backed cache memory and configuring those cards to "IT" JBOD mode. > By limiting the TXG group size to the amount which will fit in > battery-backed cache memory, the time to "commit" the TXG group is > dramatically reduced as long as the continual write rate does not exceed > what the backing disks can sustain. Unfortunately, this may increase the > total amount of data written to underlying storage. > > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100227/6891b5d8/attachment.html>
Shane Cox
2010-Feb-27 06:20 UTC
[zfs-discuss] application writes are blocked near the end of spa_sync
Thanks for your reply. I disabled write throttling, but didn''t observe any change in behavior. After doing some more research, I have a theory as to the root cause of the pauses that I''m observing. Near the end of spa_sync, writes are blocked in function zil_itx_assign as illustrated by the following lockstat output: Adaptive mutex block: 179 events in 5.015 seconds (36 events/sec) Count indv cuml rcnt nsec Hottest Lock Hottest Caller ------------------------------------------------------------------------------- 3 100% 100% 0.00 178617192 0xffffffff82a7e4c0 zil_itx_assign+0x22 This function is blocked for 178ms while attempting to get a lock on the zfs intent log. The function holding the lock is zil_itx_clean as illustrated by the following lockstat output: Adaptive mutex hold: 146357 events in 5.059 seconds (28927 events/sec) Count indv cuml rcnt nsec Lock Caller 1 0% 100% 0.00 178438696 0xffffffff82a7e4c0 zil_itx_clean+0xd1 Since zil_itx_clean holds a lock on the zfs intent log for 178ms, no new writes can be performed during this time. Looking into the source, it appears that zil_itx_clean obtains the lock on the zfs intent log, then enters a while loop, moving the already sync''d transactions into another list so that they can be freed. Here''s a comment from the code within the synchronized block: * Move the sync''d log transactions to a separate list so we can call * kmem_free without holding the zl_lock. So it appears that sync''ing the transactions to disk isn''t causing the delays. Instead, the cleanup after the sync is the problem. This cleanup holds a lock on the zfs intent log while old/sync''d transactions are moved out of the intent log, during which time new zfs writes are prohibited/blocked. At least, that''s my theory. On Fri, Feb 26, 2010 at 11:30 PM, Zhu Han <schumi.han at gmail.com> wrote:> Hi, > > This page may indicate the root cause. > http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle > > ZFS will throttle the write speed to match the write speed to the txg to > the speed of DISK IO. If it detects the modest measure(1 tick pause) cannot > prevent the tx group from being too large, it adopts a way to stall all > write request. That could be the situation you have observed. > > However, please be notice, this is may not correct since I''m not a > developer of ZFS. > > For a workaround, you may add more disk to ZFS pool to get more bandwidth > to alleviate the problem. Or you may want to disable write throttling if you > are sure the write just bursts in an extended time. Again, I''m not sure > whether the latter solution is feasible. > > best regards, > hanzhu > > > On Sat, Feb 27, 2010 at 2:29 AM, Bob Friesenhahn < > bfriesen at simple.dallas.tx.us> wrote: > >> On Fri, 26 Feb 2010, Shane Cox wrote: >> >>> >>> I''ve reviewed the forum archives and read a number of threads related to >>> this issue. However I >>> didn''t find a root-cause explanation for these pauses, only talk of how >>> to ameliorate them. In my >>> particular case, I would like to know why zfs_log_writes are blocked for >>> 180ms on a mutex (seemingly >>> blocked on the intent log itself) when performing >>> zil_itx_assign. Another thread must have a lock on >>> the intent log, no? Overall, the system appears healthy as other system >>> calls (e.g., reads and >>> writes to network devices) complete successfully while writes to the >>> intent log are blocked ... so >>> the problem seems to be access to the zfs intent log. >>> Any additional insight would be appreciated. >>> >> >> As far as I am aware, none of the zfs authors has been willing to address >> this issue in public. It is not clear (to me) if the fundmental design of >> zfs transaction groups requires that writes stop briefly until the >> transaction group has been flushed to disk. I suspect that this is the >> case. >> >> Perhaps zfs will never meet your timing requirements. Others here have >> had considerable success by using RAID interface adaptor cards with >> battery-backed cache memory and configuring those cards to "IT" JBOD mode. >> By limiting the TXG group size to the amount which will fit in >> battery-backed cache memory, the time to "commit" the TXG group is >> dramatically reduced as long as the continual write rate does not exceed >> what the backing disks can sustain. Unfortunately, this may increase the >> total amount of data written to underlying storage. >> >> >> Bob >> -- >> Bob Friesenhahn >> bfriesen at simple.dallas.tx.us, >> http://www.simplesystems.org/users/bfriesen/ >> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100227/254224f3/attachment.html>
Zhu Han
2010-Feb-27 09:09 UTC
[zfs-discuss] application writes are blocked near the end of spa_sync
Hi, That''s probably true. zil_itx_cleanup may have to move a lot of entries to another list to free up. It looks your test is related to below bugs. But I think they should have been fixed. Can you check your version of solaris/opensolaris to see whether these patch has been included. http://bugs.opensolaris.org/bugdatabase/view_bug.do;jsessionid=efb76f88d75610f8203c35f0cd52?bug_id=6529212 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6452589 best regards, hanzhu On Sat, Feb 27, 2010 at 2:20 PM, Shane Cox <shane.cox at gmail.com> wrote:> Thanks for your reply. I disabled write throttling, but didn''t observe any > change in behavior. After doing some more research, I have a theory as to > the root cause of the pauses that I''m observing. > > > Near the end of spa_sync, writes are blocked in function zil_itx_assign as > illustrated by the following lockstat output: > > Adaptive mutex block: 179 events in 5.015 seconds (36 events/sec) > > Count indv cuml rcnt nsec Hottest Lock Hottest > Caller > > ------------------------------------------------------------------------------- > 3 100% 100% 0.00 178617192 0xffffffff82a7e4c0 > zil_itx_assign+0x22 > > This function is blocked for 178ms while attempting to get a lock on the > zfs intent log. > > > The function holding the lock is zil_itx_clean as illustrated by the > following lockstat output: > > Adaptive mutex hold: 146357 events in 5.059 seconds (28927 events/sec) > Count indv cuml rcnt nsec Lock > Caller > 1 0% 100% 0.00 178438696 0xffffffff82a7e4c0 zil_itx_clean+0xd1 > > > Since zil_itx_clean holds a lock on the zfs intent log for 178ms, no new > writes can be performed during this time. > > > Looking into the source, it appears that zil_itx_clean obtains the lock on > the zfs intent log, then enters a while loop, moving the already sync''d > transactions into another list so that they can be freed. Here''s a comment > from the code within the synchronized block: > * Move the sync''d log transactions to a separate list so we can call > * kmem_free without holding the zl_lock. > > > So it appears that sync''ing the transactions to disk isn''t causing the > delays. Instead, the cleanup after the sync is the problem. This cleanup > holds a lock on the zfs intent log while old/sync''d transactions are moved > out of the intent log, during which time new zfs writes are > prohibited/blocked. > > At least, that''s my theory. > > > > > On Fri, Feb 26, 2010 at 11:30 PM, Zhu Han <schumi.han at gmail.com> wrote: > >> Hi, >> >> This page may indicate the root cause. >> http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle >> >> ZFS will throttle the write speed to match the write speed to the txg to >> the speed of DISK IO. If it detects the modest measure(1 tick pause) cannot >> prevent the tx group from being too large, it adopts a way to stall all >> write request. That could be the situation you have observed. >> >> However, please be notice, this is may not correct since I''m not a >> developer of ZFS. >> >> For a workaround, you may add more disk to ZFS pool to get more bandwidth >> to alleviate the problem. Or you may want to disable write throttling if you >> are sure the write just bursts in an extended time. Again, I''m not sure >> whether the latter solution is feasible. >> >> best regards, >> hanzhu >> >> >> On Sat, Feb 27, 2010 at 2:29 AM, Bob Friesenhahn < >> bfriesen at simple.dallas.tx.us> wrote: >> >>> On Fri, 26 Feb 2010, Shane Cox wrote: >>> >>>> >>>> I''ve reviewed the forum archives and read a number of threads related to >>>> this issue. However I >>>> didn''t find a root-cause explanation for these pauses, only talk of how >>>> to ameliorate them. In my >>>> particular case, I would like to know why zfs_log_writes are blocked for >>>> 180ms on a mutex (seemingly >>>> blocked on the intent log itself) when performing >>>> zil_itx_assign. Another thread must have a lock on >>>> the intent log, no? Overall, the system appears healthy as other system >>>> calls (e.g., reads and >>>> writes to network devices) complete successfully while writes to the >>>> intent log are blocked ... so >>>> the problem seems to be access to the zfs intent log. >>>> Any additional insight would be appreciated. >>>> >>> >>> As far as I am aware, none of the zfs authors has been willing to address >>> this issue in public. It is not clear (to me) if the fundmental design of >>> zfs transaction groups requires that writes stop briefly until the >>> transaction group has been flushed to disk. I suspect that this is the >>> case. >>> >>> Perhaps zfs will never meet your timing requirements. Others here have >>> had considerable success by using RAID interface adaptor cards with >>> battery-backed cache memory and configuring those cards to "IT" JBOD mode. >>> By limiting the TXG group size to the amount which will fit in >>> battery-backed cache memory, the time to "commit" the TXG group is >>> dramatically reduced as long as the continual write rate does not exceed >>> what the backing disks can sustain. Unfortunately, this may increase the >>> total amount of data written to underlying storage. >>> >>> >>> Bob >>> -- >>> Bob Friesenhahn >>> bfriesen at simple.dallas.tx.us, >>> http://www.simplesystems.org/users/bfriesen/ >>> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >>> >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100227/f22da8d2/attachment.html>