thr3ads.net - zfs discuss - [zfs-discuss] application writes are blocked near the end of spa

If this information is useful, please help other people find it:
Share via:

Shane Cox

2010-Feb-25 20:31 UTC

[zfs-discuss] application writes are blocked near the end of spa_sync

I''m new to ZFS and looking for some assistance with a performance
problem:

At the interval of zfs_txg_timeout (I''m using the default of 30), I
observe
100-200ms pauses in my application.  Based on my application log files, it
appears that the write() system call is blocked.  Digging deeper into the
problem with DTrace, I discovered that most of the time/pause is spent in
the function zil_itx_assign.  The output below illustrates how several write
system calls are paused in the zil_itx_assign function for approximately
180ms ... then all of them resume simultaneously.

Of interest, this behavior only occurs at the very end of spa_sync.

Timestamp    CPU    Function
82682408087424    10    write:entry
82682408099903    10    zil_itx_assign:entry
82682408102896    10    zil_itx_assign:return
82682408106429    10    write:return        Elapsed Time in micros: 19
82682408174003    10    write:entry
82682408184519    10    zil_itx_assign:entry
## Pause begins here when entering zil_itx_assign
82682408205410    6    write:entry
82682408213613    11    spa_sync:return
82682408215819    6    zil_itx_assign:entry
82682408237359    2    write:entry
82682408252117    2    zil_itx_assign:entry
## Pause ends here
82682588006927    6    zil_itx_assign:return
82682588006967    10    zil_itx_assign:return
82682588007659    2    zil_itx_assign:return
82682588020317    6    write:return        Elapsed Time in micros: 179814
82682588020321    10    write:return        Elapsed Time in micros: 179846
82682588020326    2    write:return        Elapsed Time in micros: 179782


Here is the output of  lockstat -cwPW that further implicates
zil_itx_assign:
Adaptive mutex block: 222 events in 5.014 seconds (44 events/sec)

Count indv cuml rcnt     nsec Hottest Lock           Hottest Caller

-------------------------------------------------------------------------------
    1  99%  99% 0.00 *185876716* 0xffffffff82a7e4c0
zil_itx_assign+0x22
   86   0% 100% 0.00     5424 0xffffffffa17891f8
cv_timedwait_sig_hires+0x2d4
   53   0% 100% 0.00     2328 0xffffffff9f7ef3a0     udp_rcv_enqueue+0xa0

    8   0% 100% 0.00    11664 0xffffffffa0016f18     vdev_queue_io_done+0x3a

   15   0% 100% 0.00     5350 0xffffffff85962b40     clock_tick_process+0xc4




Looking at the source for zil_itx_assign, it looks like I''m contending
for a
mutex on the zilog, but I''m not familiar enough with ZFS to know why.

Any help/suggestions would be appreciated.

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100225/b42cc5f9/attachment.html>

Bob Friesenhahn

2010-Feb-26 01:47 UTC

head link

[zfs-discuss] application writes are blocked near the end of spa_sync

On Thu, 25 Feb 2010, Shane Cox wrote:
> I''m new to ZFS and looking for some assistance with a performance
problem:
> 
> At the interval of zfs_txg_timeout (I''m using the default of 30),
I observe 100-200ms
> pauses in my application.? Based on my application log files, it appears
that the
> write() system call is blocked.? Digging deeper into the problem with
DTrace, I
If you check the forum archives you will find quite a few long 
discussion threads about this issue.  I initiated at least one of 
them.

When zfs writes a transaction group there will be a stall in an 
application which writes continuously.  The main thing you can do is 
to adjust zfs tunables to limit the size of a transaction group, or to 
increase the frequency of transaction group commits.  One such tunable 
is

   zfs:zfs_write_limit_override

set in /etc/system.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Shane Cox

2010-Feb-26 15:22 UTC

head link

[zfs-discuss] application writes are blocked near the end of spa_sync

Bob,

Thanks for your reply.  As you mentioned, adjusting the zfs tunables to
reduce the transaction group size yields shorter but more frequent pauses.
Unfortunately, this workaround doesn''t sufficiently meet our needs
(pauses
are still too long).

I''ve reviewed the forum archives and read a number of threads related
to
this issue.  However I didn''t find a root-cause explanation for these
pauses, only talk of how to ameliorate them.  In my particular case, I would
like to know why zfs_log_writes are blocked for 180ms on a mutex (seemingly
blocked on the intent log itself) when performing zil_itx_assign.  Another
thread must have a lock on the intent log, no?  Overall, the system appears
healthy as other system calls (e.g., reads and writes to network devices)
complete successfully while writes to the intent log are blocked ... so the
problem seems to be access to the zfs intent log.
Any additional insight would be appreciated.

Thanks
On Thu, Feb 25, 2010 at 8:47 PM, Bob Friesenhahn <
bfriesen at simple.dallas.tx.us> wrote:
> On Thu, 25 Feb 2010, Shane Cox wrote:
>
> I''m new to ZFS and looking for some assistance with a performance
problem:
>>
>> At the interval of zfs_txg_timeout (I''m using the default of
30), I
>> observe 100-200ms
>> pauses in my application.  Based on my application log files, it
appears
>> that the
>> write() system call is blocked.  Digging deeper into the problem with
>> DTrace, I
>>
>
> If you check the forum archives you will find quite a few long discussion
> threads about this issue.  I initiated at least one of them.
>
> When zfs writes a transaction group there will be a stall in an application
> which writes continuously.  The main thing you can do is to adjust zfs
> tunables to limit the size of a transaction group, or to increase the
> frequency of transaction group commits.  One such tunable is
>
>  zfs:zfs_write_limit_override
>
> set in /etc/system.
>
> Bob
> --
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,   
http://www.GraphicsMagick.org/<http://www.graphicsmagick.org/>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100226/ed1eb595/attachment.html>

Bob Friesenhahn

2010-Feb-26 18:29 UTC

head link

[zfs-discuss] application writes are blocked near the end of spa_sync

On Fri, 26 Feb 2010, Shane Cox wrote:> ?
> I''ve reviewed the forum archives and read a number of threads
related to this issue.? However I
> didn''t find a root-cause explanation for these pauses, only talk
of how to ameliorate them.? In my
> particular case, I would like to know why zfs_log_writes?are blocked for
180ms on a mutex (seemingly
> blocked on the intent log itself) when performing zil_itx_assign.??Another
thread must have a lock on
> the intent log, no?? Overall, the system?appears healthy as other system
calls (e.g., reads and
> writes to network devices) complete successfully while writes to the intent
log are blocked ... so
> the problem seems to be?access to the zfs intent log.
> Any additional insight would be appreciated.
As far as I am aware, none of the zfs authors has been willing to 
address this issue in public.  It is not clear (to me) if the 
fundmental design of zfs transaction groups requires that writes stop 
briefly until the transaction group has been flushed to disk.  I 
suspect that this is the case.

Perhaps zfs will never meet your timing requirements.  Others here 
have had considerable success by using RAID interface adaptor cards 
with battery-backed cache memory and configuring those cards to "IT" 
JBOD mode.  By limiting the TXG group size to the amount which will 
fit in battery-backed cache memory, the time to "commit" the TXG group
is dramatically reduced as long as the continual write rate does not 
exceed what the backing disks can sustain.  Unfortunately, this 
may increase the total amount of data written to underlying storage.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Zhu Han

2010-Feb-27 04:30 UTC

head link

[zfs-discuss] application writes are blocked near the end of spa_sync

Hi,

This page may indicate the root cause.
http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle

ZFS will throttle the write speed to match the write speed to the txg to the
speed of DISK IO. If it detects the modest measure(1 tick pause) cannot
prevent the tx group from being too large, it adopts a way to stall all
write request. That could be the situation you have observed.

However, please be notice, this is may not correct since I''m not  a
developer of ZFS.

For a workaround, you may add more disk to ZFS pool to get more bandwidth to
alleviate the problem. Or you may want to disable write throttling if you
are sure the write just bursts in an extended time. Again, I''m not sure
whether the latter solution is feasible.

best regards,
hanzhu


On Sat, Feb 27, 2010 at 2:29 AM, Bob Friesenhahn <
bfriesen at simple.dallas.tx.us> wrote:
> On Fri, 26 Feb 2010, Shane Cox wrote:
>
>>
>> I''ve reviewed the forum archives and read a number of threads
related to
>> this issue.  However I
>> didn''t find a root-cause explanation for these pauses, only
talk of how to
>> ameliorate them.  In my
>> particular case, I would like to know why zfs_log_writes are blocked
for
>> 180ms on a mutex (seemingly
>> blocked on the intent log itself) when performing zil_itx_assign. 
Another
>> thread must have a lock on
>> the intent log, no?  Overall, the system appears healthy as other
system
>> calls (e.g., reads and
>> writes to network devices) complete successfully while writes to the
>> intent log are blocked ... so
>> the problem seems to be access to the zfs intent log.
>> Any additional insight would be appreciated.
>>
>
> As far as I am aware, none of the zfs authors has been willing to address
> this issue in public.  It is not clear (to me) if the fundmental design of
> zfs transaction groups requires that writes stop briefly until the
> transaction group has been flushed to disk.  I suspect that this is the
> case.
>
> Perhaps zfs will never meet your timing requirements.  Others here have had
> considerable success by using RAID interface adaptor cards with
> battery-backed cache memory and configuring those cards to "IT"
JBOD mode.
>  By limiting the TXG group size to the amount which will fit in
> battery-backed cache memory, the time to "commit" the TXG group
is
> dramatically reduced as long as the continual write rate does not exceed
> what the backing disks can sustain.  Unfortunately, this may increase the
> total amount of data written to underlying storage.
>
>
> Bob
> --
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100227/6891b5d8/attachment.html>

Shane Cox

2010-Feb-27 06:20 UTC

head link

[zfs-discuss] application writes are blocked near the end of spa_sync

Thanks for your reply.  I disabled write throttling, but didn''t observe
any
change in behavior.  After doing some more research, I have a theory as to
the root cause of the pauses that I''m observing.


Near the end of spa_sync, writes are blocked in function zil_itx_assign as
illustrated by the following lockstat output:

Adaptive mutex block: 179 events in 5.015 seconds (36 events/sec)
Count indv cuml rcnt     nsec Hottest Lock           Hottest Caller

-------------------------------------------------------------------------------
    3 100% 100% 0.00 178617192 0xffffffff82a7e4c0     zil_itx_assign+0x22


This function is blocked for 178ms while attempting to get a lock on the zfs
intent log.


The function holding the lock is zil_itx_clean as illustrated by the
following lockstat output:

Adaptive mutex hold: 146357 events in 5.059 seconds (28927 events/sec)
Count indv cuml rcnt     nsec Lock                   Caller

    1   0% 100% 0.00 178438696 0xffffffff82a7e4c0     zil_itx_clean+0xd1


Since zil_itx_clean holds a lock on the zfs intent log for 178ms, no new
writes can be performed during this time.


Looking into the source, it appears that zil_itx_clean obtains the lock on
the zfs intent log, then enters a while loop, moving the already sync''d
transactions into another list so that they can be freed.  Here''s a
comment
from the code within the synchronized block:
* Move the sync''d log transactions to a separate list so we can call
* kmem_free without holding the zl_lock.


So it appears that sync''ing the transactions to disk isn''t
causing the
delays.  Instead, the cleanup after the sync is the problem.  This cleanup
holds a lock on the zfs intent log while old/sync''d transactions are
moved
out of the intent log, during which time new zfs writes are
prohibited/blocked.

At least, that''s my theory.



On Fri, Feb 26, 2010 at 11:30 PM, Zhu Han <schumi.han at gmail.com> wrote:
> Hi,
>
> This page may indicate the root cause.
> http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle
>
> ZFS will throttle the write speed to match the write speed to the txg to
> the speed of DISK IO. If it detects the modest measure(1 tick pause) cannot
> prevent the tx group from being too large, it adopts a way to stall all
> write request. That could be the situation you have observed.
>
> However, please be notice, this is may not correct since I''m not 
a
> developer of ZFS.
>
> For a workaround, you may add more disk to ZFS pool to get more bandwidth
> to alleviate the problem. Or you may want to disable write throttling if
you
> are sure the write just bursts in an extended time. Again, I''m not
sure
> whether the latter solution is feasible.
>
> best regards,
> hanzhu
>
>
> On Sat, Feb 27, 2010 at 2:29 AM, Bob Friesenhahn <
> bfriesen at simple.dallas.tx.us> wrote:
>
>> On Fri, 26 Feb 2010, Shane Cox wrote:
>>
>>>
>>> I''ve reviewed the forum archives and read a number of
threads related to
>>> this issue.  However I
>>> didn''t find a root-cause explanation for these pauses,
only talk of how
>>> to ameliorate them.  In my
>>> particular case, I would like to know why zfs_log_writes are
blocked for
>>> 180ms on a mutex (seemingly
>>> blocked on the intent log itself) when performing
>>> zil_itx_assign.  Another thread must have a lock on
>>> the intent log, no?  Overall, the system appears healthy as other
system
>>> calls (e.g., reads and
>>> writes to network devices) complete successfully while writes to
the
>>> intent log are blocked ... so
>>> the problem seems to be access to the zfs intent log.
>>> Any additional insight would be appreciated.
>>>
>>
>> As far as I am aware, none of the zfs authors has been willing to
address
>> this issue in public.  It is not clear (to me) if the fundmental design
of
>> zfs transaction groups requires that writes stop briefly until the
>> transaction group has been flushed to disk.  I suspect that this is the
>> case.
>>
>> Perhaps zfs will never meet your timing requirements.  Others here have
>> had considerable success by using RAID interface adaptor cards with
>> battery-backed cache memory and configuring those cards to
"IT" JBOD mode.
>>  By limiting the TXG group size to the amount which will fit in
>> battery-backed cache memory, the time to "commit" the TXG
group is
>> dramatically reduced as long as the continual write rate does not
exceed
>> what the backing disks can sustain.  Unfortunately, this may increase
the
>> total amount of data written to underlying storage.
>>
>>
>> Bob
>> --
>> Bob Friesenhahn
>> bfriesen at simple.dallas.tx.us,
>> http://www.simplesystems.org/users/bfriesen/
>> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100227/254224f3/attachment.html>

Zhu Han

2010-Feb-27 09:09 UTC

head link

[zfs-discuss] application writes are blocked near the end of spa_sync

Hi,

That''s probably true. zil_itx_cleanup  may have to move a lot of
entries to
another list to free up. It looks your test is related to below bugs. But I
think they should have been fixed. Can you check your version of
solaris/opensolaris to see whether these patch has been included.

http://bugs.opensolaris.org/bugdatabase/view_bug.do;jsessionid=efb76f88d75610f8203c35f0cd52?bug_id=6529212
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6452589

best regards,
hanzhu


On Sat, Feb 27, 2010 at 2:20 PM, Shane Cox <shane.cox at gmail.com> wrote:
> Thanks for your reply.  I disabled write throttling, but didn''t
observe any
> change in behavior.  After doing some more research, I have a theory as to
> the root cause of the pauses that I''m observing.
>
>
> Near the end of spa_sync, writes are blocked in function zil_itx_assign as
> illustrated by the following lockstat output:
>
> Adaptive mutex block: 179 events in 5.015 seconds (36 events/sec)
>
> Count indv cuml rcnt     nsec Hottest Lock           Hottest
> Caller
>
>
-------------------------------------------------------------------------------
>     3 100% 100% 0.00 178617192 0xffffffff82a7e4c0
> zil_itx_assign+0x22
>
> This function is blocked for 178ms while attempting to get a lock on the
> zfs intent log.
>
>
> The function holding the lock is zil_itx_clean as illustrated by the
> following lockstat output:
>
> Adaptive mutex hold: 146357 events in 5.059 seconds (28927 events/sec)
> Count indv cuml rcnt     nsec Lock
> Caller
>     1   0% 100% 0.00 178438696 0xffffffff82a7e4c0     zil_itx_clean+0xd1
>
>
> Since zil_itx_clean holds a lock on the zfs intent log for 178ms, no new
> writes can be performed during this time.
>
>
> Looking into the source, it appears that zil_itx_clean obtains the lock on
> the zfs intent log, then enters a while loop, moving the already
sync''d
> transactions into another list so that they can be freed.  Here''s
a comment
> from the code within the synchronized block:
> * Move the sync''d log transactions to a separate list so we can
call
> * kmem_free without holding the zl_lock.
>
>
> So it appears that sync''ing the transactions to disk
isn''t causing the
> delays.  Instead, the cleanup after the sync is the problem.  This cleanup
> holds a lock on the zfs intent log while old/sync''d transactions
are moved
> out of the intent log, during which time new zfs writes are
> prohibited/blocked.
>
> At least, that''s my theory.
>
>
>
>
> On Fri, Feb 26, 2010 at 11:30 PM, Zhu Han <schumi.han at gmail.com>
wrote:
>
>> Hi,
>>
>> This page may indicate the root cause.
>> http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle
>>
>> ZFS will throttle the write speed to match the write speed to the txg
to
>> the speed of DISK IO. If it detects the modest measure(1 tick pause)
cannot
>> prevent the tx group from being too large, it adopts a way to stall all
>> write request. That could be the situation you have observed.
>>
>> However, please be notice, this is may not correct since I''m
not  a
>> developer of ZFS.
>>
>> For a workaround, you may add more disk to ZFS pool to get more
bandwidth
>> to alleviate the problem. Or you may want to disable write throttling
if you
>> are sure the write just bursts in an extended time. Again, I''m
not sure
>> whether the latter solution is feasible.
>>
>> best regards,
>> hanzhu
>>
>>
>> On Sat, Feb 27, 2010 at 2:29 AM, Bob Friesenhahn <
>> bfriesen at simple.dallas.tx.us> wrote:
>>
>>> On Fri, 26 Feb 2010, Shane Cox wrote:
>>>
>>>>
>>>> I''ve reviewed the forum archives and read a number of
threads related to
>>>> this issue.  However I
>>>> didn''t find a root-cause explanation for these pauses,
only talk of how
>>>> to ameliorate them.  In my
>>>> particular case, I would like to know why zfs_log_writes are
blocked for
>>>> 180ms on a mutex (seemingly
>>>> blocked on the intent log itself) when performing
>>>> zil_itx_assign.  Another thread must have a lock on
>>>> the intent log, no?  Overall, the system appears healthy as
other system
>>>> calls (e.g., reads and
>>>> writes to network devices) complete successfully while writes
to the
>>>> intent log are blocked ... so
>>>> the problem seems to be access to the zfs intent log.
>>>> Any additional insight would be appreciated.
>>>>
>>>
>>> As far as I am aware, none of the zfs authors has been willing to
address
>>> this issue in public.  It is not clear (to me) if the fundmental
design of
>>> zfs transaction groups requires that writes stop briefly until the
>>> transaction group has been flushed to disk.  I suspect that this is
the
>>> case.
>>>
>>> Perhaps zfs will never meet your timing requirements.  Others here
have
>>> had considerable success by using RAID interface adaptor cards with
>>> battery-backed cache memory and configuring those cards to
"IT" JBOD mode.
>>>  By limiting the TXG group size to the amount which will fit in
>>> battery-backed cache memory, the time to "commit" the TXG
group is
>>> dramatically reduced as long as the continual write rate does not
exceed
>>> what the backing disks can sustain.  Unfortunately, this may
increase the
>>> total amount of data written to underlying storage.
>>>
>>>
>>> Bob
>>> --
>>> Bob Friesenhahn
>>> bfriesen at simple.dallas.tx.us,
>>> http://www.simplesystems.org/users/bfriesen/
>>> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>>>
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100227/f22da8d2/attachment.html>

zfs discuss - Feb 2010 - application writes are blocked near the end of spa_sync

[zfs-discuss] application writes are blocked near the end of spa_sync

[zfs-discuss] application writes are blocked near the end of spa_sync

[zfs-discuss] application writes are blocked near the end of spa_sync

[zfs-discuss] application writes are blocked near the end of spa_sync

[zfs-discuss] application writes are blocked near the end of spa_sync

[zfs-discuss] application writes are blocked near the end of spa_sync

[zfs-discuss] application writes are blocked near the end of spa_sync