thr3ads.net - zfs discuss - [zfs-code] Peak every 4-5 second [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Tharindu Rukshan Bamunuarachchi

2008-Jul-22 16:56 UTC

[zfs-code] Peak every 4-5 second

An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080722/e54f3bd7/attachment.html>
-------------- next part --------------
*******************************************************************************************************************************************************************

"The information contained in this email including in any attachment is
confidential and is meant to be read only by the person to whom it is addressed.
If you are not the intended recipient(s), you are prohibited from printing,
forwarding, saving or copying this email. If you have received this e-mail in
error, please immediately notify the sender and delete this e-mail and its
attachments from your computer."

*******************************************************************************************************************************************************************

Mark Maybee

2008-Jul-22 18:26 UTC

head link

[zfs-code] Peak every 4-5 second

ZFS is designed to "sync" a transaction group about every 5 seconds
under normal work loads.  So your system looks to be operating as
designed.  Is there some specific reason why you need to reduce this
interval?  In general, this is a bad idea, as there is somewhat of a
"fixed overhead" associated with each sync, so increasing the sync
frequency could result in increased IO.

-Mark

Tharindu Rukshan Bamunuarachchi wrote:> Dear ZFS Gurus,
> 
> We are developing low latency transaction processing systems for stock 
> exchanges.
> Low latency high performance file system is critical component of our 
> trading systems.
> 
> We have choose ZFS as our primary file system.
> But we saw periodical disk write peaks every 4-5 second.
> 
> Please refer first column of below output. (marked in bold)
> Output is generated from our own Disk performance measuring tool. i.e 
> DTool (please find attachment)
> 
> Compared UFS/VxFS , ZFS is performing very well,  but we could not 
> minimize periodical peaks.
> We used autoup and tune_r_fsflush flags for UFS tuning.
> 
> Are there any ZFS specific tuning, which will reduce file system flush 
> interval of ZFS.
> 
> I have tried all parameters specified in "solarisinternals" and
google.com.
> I would like to go for ZFS code change/recompile if necessary.
> 
> Please advice.
> 
> Cheers
> Tharindu
> 
> 
> 
> cpu4600-100 /tantan >./*DTool -f M -s 1000 -r 10000 -i 1 -W*
> System Tick = 100 usecs
> Clock resolution 10
> HR Timer created for 100usecs
> z_FileName = M
> i_Rate = 10000
> l_BlockSize = 1000
> i_SyncInterval = 0
> l_TickInterval = 100
> i_TicksPerIO = 1
> i_NumOfIOsPerSlot = 1
> Max (us)| Min (us)      | Avg (us)      | MB/S          | File          
> Freq Distribution
>   336   |  4            |  10.5635      |  4.7688       |  M       
> 50(98.55), 200(1.09), 500(0.36), 2000(0.00), 5000(0.00), 10000(0.00), 
> 100000(0.00), 200000(0.00),
>   *1911 * |  4            |  10.3152      |  9.4822       |  M       
> 50(98.90), 200(0.77), 500(0.32), 2000(0.01), 5000(0.00), 10000(0.00), 
> 100000(0.00), 200000(0.00),
>   307   |  4            |  9.9386       |  9.5324       |  M       
> 50(99.03), 200(0.66), 500(0.31), 2000(0.00), 5000(0.00), 10000(0.00), 
> 100000(0.00), 200000(0.00),
>   331   |  4            |  9.9465       |  9.5332       |  M       
> 50(99.04), 200(0.72), 500(0.24), 2000(0.00), 5000(0.00), 10000(0.00), 
> 100000(0.00), 200000(0.00),
>   318   |  4            |  10.1241      |  9.5309       |  M       
> 50(99.07), 200(0.66), 500(0.27), 2000(0.00), 5000(0.00), 10000(0.00), 
> 100000(0.00), 200000(0.00),
>   303   |  4            |  9.9236       |  9.5296       |  M       
> 50(99.13), 200(0.59), 500(0.28), 2000(0.00), 5000(0.00), 10000(0.00), 
> 100000(0.00), 200000(0.00),
>   560   |  4            |  10.2604      |  9.4565       |  M       
> 50(98.82), 200(0.86), 500(0.31), 2000(0.01), 5000(0.00), 10000(0.00), 
> 100000(0.00), 200000(0.00),
>   376   |  4            |  9.9975       |  9.5176       |  M       
> 50(99.05), 200(0.63), 500(0.32), 2000(0.00), 5000(0.00), 10000(0.00), 
> 100000(0.00), 200000(0.00),
>   *9783 * |  4            |  10.8216      |  9.5301       |  M       
> 50(99.05), 200(0.58), 500(0.36), 2000(0.00), 5000(0.00), 10000(0.01), 
> 100000(0.00), 200000(0.00),
>   332   |  4            |  9.9345       |  9.5252       |  M       
> 50(99.06), 200(0.61), 500(0.33), 2000(0.00), 5000(0.00), 10000(0.00), 
> 100000(0.00), 200000(0.00),
>   355   |  4            |  9.9906       |  9.5315       |  M       
> 50(99.01), 200(0.69), 500(0.30), 2000(0.00), 5000(0.00), 10000(0.00), 
> 100000(0.00), 200000(0.00),
>   356   |  4            |  10.2341      |  9.5207       |  M       
> 50(98.96), 200(0.76), 500(0.28), 2000(0.00), 5000(0.00), 10000(0.00), 
> 100000(0.00), 200000(0.00),
>   320   |  4            |  9.8893       |  9.5279       |  M       
> 50(99.10), 200(0.59), 500(0.31), 2000(0.00), 5000(0.00), 10000(0.00), 
> 100000(0.00), 200000(0.00),
>   *10005* |  4            |  10.8956      |  9.5258       |  M       
> 50(99.07), 200(0.63), 500(0.29), 2000(0.00), 5000(0.00), 10000(0.00), 
> 100000(0.01), 200000(0.00),
>   308   |  4            |  9.8417       |  9.5312       |  M       
> 50(99.07), 200(0.64), 500(0.29), 2000(0.00), 5000(0.00), 10000(0.00), 
> 100000(0.00), 200000(0.00),
> 
> 
> ------------------------------------------------------------------------
> 
>
*******************************************************************************************************************************************************************
> 
> "The information contained in this email including in any attachment
is confidential and is meant to be read only by the person to whom it is
addressed. If you are not the intended recipient(s), you are prohibited from
printing, forwarding, saving or copying this email. If you have received this
e-mail in error, please immediately notify the sender and delete this e-mail and
its attachments from your computer."
> 
>
*******************************************************************************************************************************************************************
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Tharindu Rukshan Bamunuarachchi

2008-Jul-23 05:35 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

Dear Mark/All,

Our trading system is writing to local and/or array volume at 10k 
messages per second.
Each message is about 700bytes in size.

Before ZFS, we used UFS.
Even with UFS, there was evey 5 second peak due to fsflush invocation.

However each peak is about ~5ms.
Our application can not recover from such higher latency.

So we used several tuning parameters (tune_r_* and autoup) to decrease 
the flush interval.
As a result peaks came down to ~1.5ms. But it is still too high for our 
application.

I believe, if we could reduce ZFS sync interval down to ~1s, peaks will 
be reduced to ~1ms or less.
We like <1ms peaks per second than 5ms peak per 5 second :-)

Are there any tunable, so i can reduce ZFS sync interval.
If there is no any tunable, can not I use "mdb" for the job ...?

This is not general and we are ok with increased I/O rate.
Please advice/help.

Thankx in advance.
tharindu


Mark Maybee wrote:> ZFS is designed to "sync" a transaction group about every 5
seconds
> under normal work loads.  So your system looks to be operating as
> designed.  Is there some specific reason why you need to reduce this
> interval?  In general, this is a bad idea, as there is somewhat of a
> "fixed overhead" associated with each sync, so increasing the
sync
> frequency could result in increased IO.
>
> -Mark
>
> Tharindu Rukshan Bamunuarachchi wrote:
>> Dear ZFS Gurus,
>>
>> We are developing low latency transaction processing systems for 
>> stock exchanges.
>> Low latency high performance file system is critical component of our 
>> trading systems.
>>
>> We have choose ZFS as our primary file system.
>> But we saw periodical disk write peaks every 4-5 second.
>>
>> Please refer first column of below output. (marked in bold)
>> Output is generated from our own Disk performance measuring tool. i.e 
>> DTool (please find attachment)
>>
>> Compared UFS/VxFS , ZFS is performing very well,  but we could not 
>> minimize periodical peaks.
>> We used autoup and tune_r_fsflush flags for UFS tuning.
>>
>> Are there any ZFS specific tuning, which will reduce file system 
>> flush interval of ZFS.
>>
>> I have tried all parameters specified in "solarisinternals"
and
>> google.com.
>> I would like to go for ZFS code change/recompile if necessary.
>>
>> Please advice.
>>
>> Cheers
>> Tharindu
>>
>>
>>
>> cpu4600-100 /tantan >./*DTool -f M -s 1000 -r 10000 -i 1 -W*
>> System Tick = 100 usecs
>> Clock resolution 10
>> HR Timer created for 100usecs
>> z_FileName = M
>> i_Rate = 10000
>> l_BlockSize = 1000
>> i_SyncInterval = 0
>> l_TickInterval = 100
>> i_TicksPerIO = 1
>> i_NumOfIOsPerSlot = 1
>> Max (us)| Min (us)      | Avg (us)      | MB/S          | 
>> File          Freq Distribution
>>   336   |  4            |  10.5635      |  4.7688       |  M       
>> 50(98.55), 200(1.09), 500(0.36), 2000(0.00), 5000(0.00), 10000(0.00), 
>> 100000(0.00), 200000(0.00),
>>   *1911 * |  4            |  10.3152      |  9.4822       |  M       
>> 50(98.90), 200(0.77), 500(0.32), 2000(0.01), 5000(0.00), 10000(0.00), 
>> 100000(0.00), 200000(0.00),
>>   307   |  4            |  9.9386       |  9.5324       |  M       
>> 50(99.03), 200(0.66), 500(0.31), 2000(0.00), 5000(0.00), 10000(0.00), 
>> 100000(0.00), 200000(0.00),
>>   331   |  4            |  9.9465       |  9.5332       |  M       
>> 50(99.04), 200(0.72), 500(0.24), 2000(0.00), 5000(0.00), 10000(0.00), 
>> 100000(0.00), 200000(0.00),
>>   318   |  4            |  10.1241      |  9.5309       |  M       
>> 50(99.07), 200(0.66), 500(0.27), 2000(0.00), 5000(0.00), 10000(0.00), 
>> 100000(0.00), 200000(0.00),
>>   303   |  4            |  9.9236       |  9.5296       |  M       
>> 50(99.13), 200(0.59), 500(0.28), 2000(0.00), 5000(0.00), 10000(0.00), 
>> 100000(0.00), 200000(0.00),
>>   560   |  4            |  10.2604      |  9.4565       |  M       
>> 50(98.82), 200(0.86), 500(0.31), 2000(0.01), 5000(0.00), 10000(0.00), 
>> 100000(0.00), 200000(0.00),
>>   376   |  4            |  9.9975       |  9.5176       |  M       
>> 50(99.05), 200(0.63), 500(0.32), 2000(0.00), 5000(0.00), 10000(0.00), 
>> 100000(0.00), 200000(0.00),
>>   *9783 * |  4            |  10.8216      |  9.5301       |  M       
>> 50(99.05), 200(0.58), 500(0.36), 2000(0.00), 5000(0.00), 10000(0.01), 
>> 100000(0.00), 200000(0.00),
>>   332   |  4            |  9.9345       |  9.5252       |  M       
>> 50(99.06), 200(0.61), 500(0.33), 2000(0.00), 5000(0.00), 10000(0.00), 
>> 100000(0.00), 200000(0.00),
>>   355   |  4            |  9.9906       |  9.5315       |  M       
>> 50(99.01), 200(0.69), 500(0.30), 2000(0.00), 5000(0.00), 10000(0.00), 
>> 100000(0.00), 200000(0.00),
>>   356   |  4            |  10.2341      |  9.5207       |  M       
>> 50(98.96), 200(0.76), 500(0.28), 2000(0.00), 5000(0.00), 10000(0.00), 
>> 100000(0.00), 200000(0.00),
>>   320   |  4            |  9.8893       |  9.5279       |  M       
>> 50(99.10), 200(0.59), 500(0.31), 2000(0.00), 5000(0.00), 10000(0.00), 
>> 100000(0.00), 200000(0.00),
>>   *10005* |  4            |  10.8956      |  9.5258       |  M       
>> 50(99.07), 200(0.63), 500(0.29), 2000(0.00), 5000(0.00), 10000(0.00), 
>> 100000(0.01), 200000(0.00),
>>   308   |  4            |  9.8417       |  9.5312       |  M       
>> 50(99.07), 200(0.64), 500(0.29), 2000(0.00), 5000(0.00), 10000(0.00), 
>> 100000(0.00), 200000(0.00),
>>
>>
>>
------------------------------------------------------------------------
>>
>>
*******************************************************************************************************************************************************************
>>
>>
>> "The information contained in this email including in any
attachment
>> is confidential and is meant to be read only by the person to whom it 
>> is addressed. If you are not the intended recipient(s), you are 
>> prohibited from printing, forwarding, saving or copying this email. 
>> If you have received this e-mail in error, please immediately notify 
>> the sender and delete this e-mail and its attachments from your 
>> computer."
>>
>>
*******************************************************************************************************************************************************************
>>
>>
>>
>>
------------------------------------------------------------------------
>>
>> _______________________________________________
>> zfs-code mailing list
>> zfs-code at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-code
>
*******************************************************************************************************************************************************************

"The information contained in this email including in any attachment is
confidential and is meant to be read only by the person to whom it is addressed.
If you are not the intended recipient(s), you are prohibited from printing,
forwarding, saving or copying this email. If you have received this e-mail in
error, please immediately notify the sender and delete this e-mail and its
attachments from your computer."

*******************************************************************************************************************************************************************

Robert Milkowski

2008-Jul-23 08:42 UTC

head link

[zfs-code] Peak every 4-5 second

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><title></title>
<META http-equiv=Content-Type content="text/html;
charset=iso-8859-2">
<meta http-equiv="Content-Style-Type"
content="text/css">
<style type="text/css"><!--
body {
  margin: 5px 5px 5px 5px;
  background-color: #ffffff;
}
/* ---------- Text Styles ---------- */
hr { color: #000000}
body, table /* Normal text */
{
 font-size: 9pt;
 font-family: ''Courier New'';
 font-style: normal;
 font-weight: normal;
 color: #000000;
 text-decoration: none;
}
span.rvts1 /* Heading */
{
 font-size: 10pt;
 font-family: ''Arial'';
 font-weight: bold;
 color: #0000ff;
}
span.rvts2 /* Subheading */
{
 font-size: 10pt;
 font-family: ''Arial'';
 font-weight: bold;
 color: #000080;
}
span.rvts3 /* Keywords */
{
 font-size: 10pt;
 font-family: ''Arial'';
 font-style: italic;
 color: #800000;
}
a.rvts4, span.rvts4 /* Jump 1 */
{
 font-size: 10pt;
 font-family: ''Arial'';
 color: #008000;
 text-decoration: underline;
}
a.rvts5, span.rvts5 /* Jump 2 */
{
 font-size: 10pt;
 font-family: ''Arial'';
 color: #008000;
 text-decoration: underline;
}
span.rvts6
{
 font-size: 11pt;
 font-family: ''tahoma'';
 font-weight: bold;
 color: #ffffff;
}
span.rvts7
{
 font-size: 11pt;
 font-family: ''times new roman'';
}
span.rvts8
{
 font-size: 11pt;
 font-family: ''times new roman'';
 font-weight: bold;
}
span.rvts9
{
 font-size: 11pt;
 font-family: ''tahoma'';
}
span.rvts10
{
 font-size: 8pt;
 font-family: ''arial'';
 font-style: italic;
 color: #c0c0c0;
}
a.rvts11, span.rvts11
{
 font-size: 8pt;
 font-family: ''arial'';
 color: #0000ff;
 text-decoration: underline;
}
/* ---------- Para Styles ---------- */
p,ul,ol /* Paragraph Style */
{
 text-align: left;
 text-indent: 0px;
 padding: 0px 0px 0px 0px;
 margin: 0px 0px 0px 0px;
}
.rvps1 /* Centered */
{
 text-align: center;
}
--></style>
</head>
<body>

<p>Hello Tharindu,</p>
<p><br></p>
<p>Tuesday, July 22, 2008, 5:56:58 PM, you wrote:</p>
<p><br></p>
<div><table border=0 cellpadding=1 cellspacing=2
style="border-color: #000000; border-style: solid; background-color:
#ffffff;">
<tr valign=top>
<td width=13 style="background-color: #0000ff;">
<p><span class=rvts6>></span></p>
</td>
<td width=990>
<p><span class=rvts7>Dear ZFS Gurus,</span></p>
<p><br></p>
<p><span class=rvts7>We are developing low latency transaction
processing systems for stock exchanges.&nbsp;</span></p>
<p><span class=rvts7>Low latency high performance file system is
critical component of our trading systems.</span></p>
<p><br></p>
<p><span class=rvts7>We have choose ZFS as our primary file
system.&nbsp;</span></p>
<p><span class=rvts7>But we saw periodical disk write peaks every
4-5 second.</span></p>
<p><br></p>
<p><span class=rvts7>Please refer first column of below output.
(marked in bold)</span></p>
<p><span class=rvts7>Output is generated from our own Disk
performance measuring tool. i.e DTool (please find
attachment)</span></p>
<p><br></p>
<p><span class=rvts7>Compared UFS/VxFS , ZFS is performing very
well, &nbsp;but we could not minimize periodical
peaks.</span></p>
<p><span class=rvts7>We used autoup and tune_r_fsflush flags for UFS
tuning.</span></p>
<p><br></p>
<p><span class=rvts7>Are there any ZFS specific tuning, which will
reduce file system flush interval of ZFS.</span></p>
<p><span class=rvts9><br></span></p>
</td>
</tr>
</table>
</div>
<p><br></p>
<p><br></p>
<p>You can tune it by changing txg_time via mdb or /etc/system. By default
currently is set to 5 seconds.</p>
<p><br></p>
<p>However as Mark pointed out - first ask yourself why you think it is a
problem for you.</p>
<p><br></p>
<p><br></p>
<p><br></p>
<p><br></p>
<p><span class=rvts10>--&nbsp;</span></p>
<p><span class=rvts10>Best regards,</span></p>
<p><span class=rvts10>&nbsp;Robert &nbsp;Milkowski
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp;&nbsp;</span><a class=rvts11
href="mailto:milek@task.gda.pl">mailto:milek@task.gda.pl</a></p>
<p><span class=rvts10>&nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp;</span><a class=rvts11
href="http://milek.blogspot.com">http://milek.blogspot.com</a></p>

</body></html>

Robert Milkowski

2008-Jul-23 08:50 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

Hello Tharindu,

Wednesday, July 23, 2008, 6:35:33 AM, you wrote:

TRB> Dear Mark/All,

TRB> Our trading system is writing to local and/or array volume at 10k 
TRB> messages per second.
TRB> Each message is about 700bytes in size.

TRB> Before ZFS, we used UFS.
TRB> Even with UFS, there was evey 5 second peak due to fsflush invocation.

TRB> However each peak is about ~5ms.
TRB> Our application can not recover from such higher latency.

TRB> So we used several tuning parameters (tune_r_* and autoup) to decrease
TRB> the flush interval.
TRB> As a result peaks came down to ~1.5ms. But it is still too high for our
TRB> application.

TRB> I believe, if we could reduce ZFS sync interval down to ~1s, peaks will
TRB> be reduced to ~1ms or less.
TRB> We like <1ms peaks per second than 5ms peak per 5 second :-)

TRB> Are there any tunable, so i can reduce ZFS sync interval.
TRB> If there is no any tunable, can not I use "mdb" for the job
...?

TRB> This is not general and we are ok with increased I/O rate.
TRB> Please advice/help.

txt_time/D

btw:
     10,000 * 700 = ~7MB

What''s your storage subsystem? Any, even small, raid device with write
cache should help.


-- 
Best regards,
 Robert Milkowski                            mailto:milek at task.gda.pl
                                       http://milek.blogspot.com

Tharindu Rukshan Bamunuarachchi

2008-Jul-23 08:54 UTC

head link

[zfs-code] Peak every 4-5 second

An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20080723/124fec9e/attachment.html>
-------------- next part --------------
*******************************************************************************************************************************************************************

"The information contained in this email including in any attachment is
confidential and is meant to be read only by the person to whom it is addressed.
If you are not the intended recipient(s), you are prohibited from printing,
forwarding, saving or copying this email. If you have received this e-mail in
error, please immediately notify the sender and delete this e-mail and its
attachments from your computer."

*******************************************************************************************************************************************************************

Tharindu Rukshan Bamunuarachchi

2008-Jul-23 09:03 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080723/bfd2f680/attachment.html>
-------------- next part --------------
*******************************************************************************************************************************************************************

"The information contained in this email including in any attachment is
confidential and is meant to be read only by the person to whom it is addressed.
If you are not the intended recipient(s), you are prohibited from printing,
forwarding, saving or copying this email. If you have received this e-mail in
error, please immediately notify the sender and delete this e-mail and its
attachments from your computer."

*******************************************************************************************************************************************************************

Tharindu Rukshan Bamunuarachchi

2008-Jul-23 09:05 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080723/0a650762/attachment.html>
-------------- next part --------------
*******************************************************************************************************************************************************************

"The information contained in this email including in any attachment is
confidential and is meant to be read only by the person to whom it is addressed.
If you are not the intended recipient(s), you are prohibited from printing,
forwarding, saving or copying this email. If you have received this e-mail in
error, please immediately notify the sender and delete this e-mail and its
attachments from your computer."

*******************************************************************************************************************************************************************

Frank.Hofmann at Sun.COM

2008-Jul-23 11:04 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

On Wed, 23 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:
> 10,000 x 700 = 7MB per second ......
> 
> We have this rate for whole day ....
> 
> 10,000 orders per second is minimum requirments of modern day stock
exchanges ...
> 
> Cache still help us for ~1 hours, but after that who will help us ...
> 
> We are using 2540 for current testing ...
> I have tried same with 6140, but no significant improvement ... only one or
two hours ...
It might not be exactly what you have in mind, but this "how do I get 
latency down at all costs" thing reminded me of this old paper:

 	http://www.sun.com/blueprints/1000/layout.pdf

I''m not a storage architect, someone with more experience in the area
care
to comment on this ? With huge disks as we have these days, the "wide 
thin" idea has gone under a bit - but how to replace such setups with 
modern arrays, if the workload is such that caches eventually must get 
blown and you''re down to spindle speed ?

FrankH.
> 
> Robert Milkowski wrote:
>
>  Hello Tharindu,
> 
> Wednesday, July 23, 2008, 6:35:33 AM, you wrote:
> 
> TRB> Dear Mark/All,
> 
> TRB> Our trading system is writing to local and/or array volume at 10k 
> TRB> messages per second.
> TRB> Each message is about 700bytes in size.
> 
> TRB> Before ZFS, we used UFS.
> TRB> Even with UFS, there was evey 5 second peak due to fsflush
invocation.
> 
> TRB> However each peak is about ~5ms.
> TRB> Our application can not recover from such higher latency.
> 
> TRB> So we used several tuning parameters (tune_r_* and autoup) to
decrease
> TRB> the flush interval.
> TRB> As a result peaks came down to ~1.5ms. But it is still too high for
our
> TRB> application.
> 
> TRB> I believe, if we could reduce ZFS sync interval down to ~1s, peaks
will
> TRB> be reduced to ~1ms or less.
> TRB> We like <1ms peaks per second than 5ms peak per 5 second :-)
> 
> TRB> Are there any tunable, so i can reduce ZFS sync interval.
> TRB> If there is no any tunable, can not I use "mdb" for the
job ...?
> 
> TRB> This is not general and we are ok with increased I/O rate.
> TRB> Please advice/help.
> 
> txt_time/D
> 
> btw:
>      10,000 * 700 = ~7MB
> 
> What''s your storage subsystem? Any, even small, raid device with
write
> cache should help.
> 
>
> 
> 
> 
>
------------------------------------------------------------------------------
No good can come from selling your freedom, not for all the gold in the world,
for the value of this heavenly gift far exceeds that of any fortune on earth.
------------------------------------------------------------------------------

Bob Friesenhahn

2008-Jul-23 15:22 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

On Wed, 23 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:
> 10,000 x 700 = 7MB per second ......
> 
> We have this rate for whole day ....
> 
> 10,000 orders per second is minimum requirments of modern day stock
exchanges ...
> 
> Cache still help us for ~1 hours, but after that who will help us ...
> 
> We are using 2540 for current testing ...
> I have tried same with 6140, but no significant improvement ... only one or
two hours ...
Does your application request synchronous file writes or use fsync()? 
While normally fsync() slows performance I think that it will also 
serve to even the write response since ZFS will not be buffering lots 
of unwritten data.  However, there may be buffered writes from other 
applications which gets written periodically and which may delay the 
writes from your critical application.  In this case reducing the ARC 
size may help so that the ZFS sync takes less time.

You could also run a script which executes ''sync'' every second
or two
in order to convince ZFS to cache less unwritten data. This will cause 
a bit of a performance hit for the whole system though.

You 7MB per second is a very tiny write load so it is worthwhile 
investigating to see if there are other factors which are causing your 
storage system to not perform correctly.  The 2540 is capable of 
supporting writes at hundreds of MB per second.

As an example of "another factor", let''s say that you used
the 2540 to
create 6 small LUNs and then put them into a ZFS zraid.  However, in 
this case the 2540 allocated all of the LUNs from the same disk (which 
it is happy to do by default) so now that disk is being severely 
thrashed since it is one disk rather than six.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2008-Jul-23 18:18 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

Frank.Hofmann at Sun.COM wrote:> On Wed, 23 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:
>
>   
>> 10,000 x 700 = 7MB per second ......
>>
>> We have this rate for whole day ....
>>
>> 10,000 orders per second is minimum requirments of modern day stock
exchanges ...
>>
>> Cache still help us for ~1 hours, but after that who will help us ...
>>
>> We are using 2540 for current testing ...
>> I have tried same with 6140, but no significant improvement ... only
one or two hours ...
>>     
>
> It might not be exactly what you have in mind, but this "how do I get 
> latency down at all costs" thing reminded me of this old paper:
>
>  	http://www.sun.com/blueprints/1000/layout.pdf
>
> I''m not a storage architect, someone with more experience in the
area care
> to comment on this ? With huge disks as we have these days, the "wide 
> thin" idea has gone under a bit - but how to replace such setups with 
> modern arrays, if the workload is such that caches eventually must get 
> blown and you''re down to spindle speed ?
>   
Bob Larson wrote that article, and I would love to ask him for an
update.  Unfortunately, he passed away a few years ago :-(
http://blogs.sun.com/relling/entry/bob_larson_my_friend

I think the model still holds true, the per-disk performance hasn''t
significantly changed since it was written.

This particular problem screams for a queuing model.  You don''t
really need to have a huge cache as long as you can de-stage
efficiently.  However, the original poster hasn''t shared the read
workload details... if you never read, it is a trivial problem to
solve with a WOM.
 -- richard

Brandon High

2008-Jul-23 20:16 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

On Tue, Jul 22, 2008 at 10:35 PM, Tharindu Rukshan Bamunuarachchi
<tharindub at millenniumit.com> wrote:>
> Dear Mark/All,
>
> Our trading system is writing to local and/or array volume at 10k
> messages per second.
> Each message is about 700bytes in size.
>
> Before ZFS, we used UFS.
> Even with UFS, there was evey 5 second peak due to fsflush invocation.
>
> However each peak is about ~5ms.
> Our application can not recover from such higher latency.
Is the pool using raidz, raidz2, or mirroring? How many drives are you using?

-B

-- 
Brandon High bhigh at freaks.com
"The good is the enemy of the best." - Nietzsche

Tharindu Rukshan Bamunuarachchi

2008-Jul-24 05:02 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080724/6d88dbfc/attachment.html>
-------------- next part --------------
*******************************************************************************************************************************************************************

"The information contained in this email including in any attachment is
confidential and is meant to be read only by the person to whom it is addressed.
If you are not the intended recipient(s), you are prohibited from printing,
forwarding, saving or copying this email. If you have received this e-mail in
error, please immediately notify the sender and delete this e-mail and its
attachments from your computer."

*******************************************************************************************************************************************************************

Brandon High

2008-Jul-24 09:01 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

On Wed, Jul 23, 2008 at 10:02 PM, Tharindu Rukshan Bamunuarachchi
<tharindub at millenniumit.com> wrote:> We do not use raidz*. Virtually, no raid or stripe through OS.
So it''s ZFS on a single LUN exported from the 2540? Or have you
created a zpool from multiple raid1 LUNs on the 2540?

Have you tried exporting the individual drives and using zfs to handle
the mirroring? It might have better performance in your situation.

-B

-- 
Brandon High bhigh at freaks.com
"The good is the enemy of the best." - Nietzsche

Tharindu Rukshan Bamunuarachchi

2008-Jul-24 11:55 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080724/586b5ff4/attachment.html>
-------------- next part --------------
*******************************************************************************************************************************************************************

"The information contained in this email including in any attachment is
confidential and is meant to be read only by the person to whom it is addressed.
If you are not the intended recipient(s), you are prohibited from printing,
forwarding, saving or copying this email. If you have received this e-mail in
error, please immediately notify the sender and delete this e-mail and its
attachments from your computer."

*******************************************************************************************************************************************************************

David Collier-Brown

2008-Jul-24 12:52 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

Hmmn, that *sounds* as if you are saying you''ve a very-high-redundancy
RAID1 mirror, 4 disks deep, on an ''enterprise-class tier 2
storage'' array
that doesn''t support RAID 1+0 or 0+1. 

  That sounds weird: the 2540 supports RAID levels 0, 1, (1+0), 3 and 5,
and deep mirrors are normally only used on really fast equipment in
mission-critical tier 1 storage...

  Are you sure you don''t mean you have raid 0 (stripes) 4 disks wide,
each stripe presented as a LUN?

  If you really have 4-deep RAID 1, you have a configuration that will
perform somewhat slower than any single disk, as the array launches
4 writes to 4 drives in parallel, and returns success when they
all complete.

  If you had 4-wide RAID 0, with mirroring done at the host, you would
have a configuration that would (probabilistically) perform better than 
a single drive when writing to each side of the mirror, and the write
would return success when the slowest side of the mirror completed.

 --dave (puzzled!) c-b

Tharindu Rukshan Bamunuarachchi wrote:> We do not use raidz*. Virtually, no raid or stripe through OS.
> 
> We have 4 disk RAID1 volumes.  RAID1 was created from CAM on 2540.
> 
> 2540 does not have RAID 1+0 or 0+1.
> 
> cheers
> tharindu
> 
> Brandon High wrote:
> 
>>On Tue, Jul 22, 2008 at 10:35 PM, Tharindu Rukshan Bamunuarachchi
>><tharindub at millenniumit.com> wrote:
>>  
>>
>>>Dear Mark/All,
>>>
>>>Our trading system is writing to local and/or array volume at 10k
>>>messages per second.
>>>Each message is about 700bytes in size.
>>>
>>>Before ZFS, we used UFS.
>>>Even with UFS, there was evey 5 second peak due to fsflush
invocation.
>>>
>>>However each peak is about ~5ms.
>>>Our application can not recover from such higher latency.
>>>    
>>>
>>
>>Is the pool using raidz, raidz2, or mirroring? How many drives are you
using?
>>
>>-B
>>
>>  
>>
> 
> ------------------------------------------------------------------------
> 
>
*******************************************************************************************************************************************************************
> 
> "The information contained in this email including in any attachment
is confidential and is meant to be read only by the person to whom it is
addressed. If you are not the intended recipient(s), you are prohibited from
printing, forwarding, saving or copying this email. If you have received this
e-mail in error, please immediately notify the sender and delete this e-mail and
its attachments from your computer."
> 
>
*******************************************************************************************************************************************************************
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
David Collier-Brown            | Always do right. This will gratify
Sun Microsystems, Toronto      | some people and astonish the rest
davecb at sun.com                 |                      -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#

Bob Friesenhahn

2008-Jul-24 15:08 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

On Thu, 24 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:
> We do not use raidz*. Virtually, no raid or stripe through OS.
> 
> We have 4 disk RAID1 volumes.? RAID1 was created from CAM on 2540.
What ZFS block size are you using?

Are you using synchronous writes for each 700byte message?  10k 
synchronous writes per second is pretty high and would depend heavily 
on the 2540''s write cache and how the 2540''s firmware behaves.

You will find some cache tweaks for the 2540 in my writeup available 
at 
http://www.simplesystems.org/users/bfriesen/zfs-discuss/2540-zfs-performance.pdf.

Without these tweaks, the 2540 waits for the data to be written to 
disk rather than written to its NVRAM whenever ZFS flushes the write 
cache.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2008-Jul-24 15:29 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

On Thu, 24 Jul 2008, Brandon High wrote:>
> Have you tried exporting the individual drives and using zfs to handle
> the mirroring? It might have better performance in your situation.
It should indeed have better performance.  The single LUN exported 
from the 2540 will be treated like a single drive from ZFS''s 
perspective.  The data written needs to be serialized in the same way 
that it would be for a drive.  ZFS has no understanding that some 
offsets will access a different drive so it may be that one pair of 
drives is experiencing all of the load.

The most performant configuration would be to export a LUN from each 
of the 2540''s 12 drives and create a pool of 6 mirrors.  In this 
situation, ZFS will load share across the 6 mirrors so that each pair 
gets its fair share of the IOPS based on its backlog.

The 2540 cache tweaks will also help tremendously for this sort of 
work load.

Since this is for critical data I would not disable the cache 
mirroring in the 2540''s controllers.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Robert Milkowski

2008-Jul-25 14:43 UTC

head link

Re: [zfs-code] Peak every 4-5 second

Hello Tharindu,

Wednesday, July 23, 2008, 10:03:15 AM, you wrote:

&gt;

10,000 x 700 = 7MB per second ......

We have this rate for whole day ....

10,000 orders per second is minimum requirments of modern day stock exchanges
...

Cache still help us for ~1 hours, but after that who will help us ...

Have you disable on zfs side SCSI flushes? Or have you disabled it on the array?

Both on 2540 and 6540 if you do not disable it your performance will be very bad
especially for synchronous IOs as ZIL will force your array to flush its cache
every time. If you are not using ZFS on any other storage than 2540 on your
servers then put "set zfs:zfs_nocacheflush=1" in /etc/system and do a
reboot. If you haven''t done so it should help you considerably.

With such relatively low throughput and with ZFS, plus cache on the array (after
above correction), plus you stated in another email you are basically not
reading at all you should cache evrything in the array and then stream it to
disks (partly thanks to CoW in ZFS). 

Additional question is - how do you write your data? Are you updating larger
files or creating a new file each time, or...?

-- 

Best regards,

 Robert Milkowski                           mailto:milek@task.gda.pl

                                       http://milek.blogspot.com


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Robert Milkowski

2008-Jul-25 15:08 UTC

head link

Re: [zfs-code] Peak every 4-5 second

Hello Tharindu,

Thursday, July 24, 2008, 6:02:31 AM, you wrote:

&gt;

We do not use raidz*. Virtually, no raid or stripe through OS.

We have 4 disk RAID1 volumes.  RAID1 was created from CAM on 2540.

2540 does not have RAID 1+0 or 0+1.

Of course it does 1+0. Just add more drives to RAID-1

-- 

Best regards,

 Robert Milkowski                           mailto:milek@task.gda.pl

                                       http://milek.blogspot.com


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Bob Friesenhahn

2008-Jul-25 15:58 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

On Fri, 25 Jul 2008, Robert Milkowski wrote:
> Both on 2540 and 6540 if you do not disable it your performance will 
> be very bad especially for synchronous IOs as ZIL will force your 
> array to flush its cache every time. If you are not using ZFS on any 
> other storage than 2540 on your servers then put "set 
> zfs:zfs_nocacheflush=1" in /etc/system and do a reboot. If you 
> haven''t done so it should help you considerably.
This does not seem wise since then data (records of trades) may be 
lost if the system crashes or loses power.  It is much better to apply 
the firmware tweaks so that the 2540 reports that the data is written 
as soon as it is safely in its NVRAM rather than waiting for it to be 
on disk.  ZFS should then perform rather well with low latency. 
However, I have yet to see any response from Tharindu which indicates 
he has seen any of my emails regarding this (or many emails from 
others).  Based on his responses I would assume that Tharindu is 
seeing less than a third of the response messages regarding his topic.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

David Collier-Brown

2008-Jul-25 16:17 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

And do you really have 4-sided raid 1 mirrors, not 4-wide raid-0 stripes???

--dave

Robert Milkowski wrote:> Hello Tharindu,
> 
> 
> Thursday, July 24, 2008, 6:02:31 AM, you wrote:
> 
> 
>>
> 
> 	
> 
> We do not use raidz*. Virtually, no raid or stripe through OS.
> 
> 
> We have 4 disk RAID1 volumes.  RAID1 was created from CAM on 2540.
> 
> 
> 2540 does not have RAID 1+0 or 0+1.
> 
> 
> 
> 
> Of course it does 1+0. Just add more drives to RAID-1
> 
> 
> 
> 
> -- 
> 
> Best regards,
> 
>  Robert Milkowski                           mailto:milek at task.gda.pl
> 
>                                        http://milek.blogspot.com
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
David Collier-Brown            | Always do right. This will gratify
Sun Microsystems, Toronto      | some people and astonish the rest
davecb at sun.com                 |                      -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#

Brandon High

2008-Jul-25 18:02 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

On Fri, Jul 25, 2008 at 9:17 AM, David Collier-Brown <davecb at sun.com>
wrote:> And do you really have 4-sided raid 1 mirrors, not 4-wide raid-0 stripes???
Or perhaps 4 RAID1 mirrors concatenated?

-B

-- 
Brandon High bhigh at freaks.com
"The good is the enemy of the best." - Nietzsche

David Collier-Brown

2008-Jul-26 14:17 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

Brandon High wrote:> On Fri, Jul 25, 2008 at 9:17 AM, David Collier-Brown <davecb at
sun.com> wrote:
> 
>>And do you really have 4-sided raid 1 mirrors, not 4-wide raid-0
stripes???
> 
> 
> Or perhaps 4 RAID1 mirrors concatenated?
> I wondered that too, but he insists he doesn''t have 0+1 or 1+0...

Tharindu. could you clarify this for us? It significantly
affects what advice we give!

--dave (former tech lead, performance engineering at ACE) c-b
-- 
David Collier-Brown            | Always do right. This will gratify
Sun Microsystems, Toronto      | some people and astonish the rest
davecb at sun.com                 |                      -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#

Tharindu Rukshan Bamunuarachchi

2008-Jul-26 17:02 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080726/43bc0268/attachment.html>
-------------- next part --------------
*******************************************************************************************************************************************************************

"The information contained in this email including in any attachment is
confidential and is meant to be read only by the person to whom it is addressed.
If you are not the intended recipient(s), you are prohibited from printing,
forwarding, saving or copying this email. If you have received this e-mail in
error, please immediately notify the sender and delete this e-mail and its
attachments from your computer."

*******************************************************************************************************************************************************************

Bob Friesenhahn

2008-Jul-26 17:19 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

On Sat, 26 Jul 2008, Tharindu Rukshan Bamunuarachchi
wrote:> 
> 1.Re configure array with 12 independent disks
> 2. Allocate disks to RAIDZed pool
Using raidz will penalize your transaction performance since all disks 
will need to perform I/O for each write.  It is definitely better to 
use load-shared mirrors for this purpose.
> 3. Fine tune the 2540 according to Bob''s 2540-ZFS-Performance.pdf
(Thankx Bob)
> 4. Apply ZFS tunings (i.e. zfs_nocacheflush=1 etc.)
Hopefully after step #3, step#4 will not be required.  Step #4 puts 
data at risk if there is a system crash.
> However, I could not find additional cards to support I/O Multipath. Hope
that would not affect
> on latency.
Probably not.  It will effect sequential I/O performance but latency 
is primarily dependent on disk configuration and ZFS filesystem block 
size.

I have performed some tests here of synchronous writes using iozone 
with multi-threaded readers/writers.  This is for the same 2540 
configuration that I wrote about earlier.  For this particular test, 
the ZFS filesystem blocksize is 8K and the size of the I/Os is 8K. 
This may not be a good representation of your own workload since the 
threads are contending for I/O with random access.  In your case, it 
seems that writes may be written in a sequential append mode.

I also have test results handy for similar test parameters but using 
various ZFS filesystem settings (8K/128K block size, checksum 
enable/disable, noatime, and sha256 checksum), and 8K or 128K I/O 
block sizes.  Let me know if there is something you would like for me 
to measure.  It should be easy to simulate your application behavior 
using iozone.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

 	Iozone: Performance Test of File I/O
 	        Version $Revision: 3.283 $
 		Compiled for 64 bit mode.
 		Build: Solaris10gcc-64

 	Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
 	             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
 	             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
 	             Randy Dunlap, Mark Montague, Dan Million,
 	             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy,
 	             Erik Habbinga, Kris Strecker, Walter Wong.

 	Run began: Wed Jul  2 10:54:19 2008

 	Multi_buffer. Work area 16777216 bytes
 	OPS Mode. Output is in operations per second.
 	Record Size 8 KB
 	SYNC Mode.
 	File size set to 2097152 KB
 	Command line used: iozone -m -t 8 -T -O -r 8k -o -s 2G
 	Time Resolution = 0.000001 seconds.
 	Processor cache size set to 1024 Kbytes.
 	Processor cache line size set to 32 bytes.
 	File stride size set to 17 * record size.
 	Throughput test with 8 threads
 	Each thread writes a 2097152 Kbyte file in 8 Kbyte records

 	Children see throughput for  8 initial writers 	=    4315.57 ops/sec
 	Parent sees throughput for  8 initial writers 	=    4266.15 ops/sec
 	Min throughput per thread 			=     532.18 ops/sec
 	Max throughput per thread 			=     543.36 ops/sec
 	Avg throughput per thread 			=     539.45 ops/sec
 	Min xfer 					=  256746.00 ops

 	Children see throughput for  8 rewriters 	=    2595.08 ops/sec
 	Parent sees throughput for  8 rewriters 	=    2595.06 ops/sec
 	Min throughput per thread 			=     322.07 ops/sec
 	Max throughput per thread 			=     326.15 ops/sec
 	Avg throughput per thread 			=     324.38 ops/sec
 	Min xfer 					=  258867.00 ops

 	Children see throughput for  8 readers 		=   53462.03 ops/sec
 	Parent sees throughput for  8 readers 		=   53451.08 ops/sec
 	Min throughput per thread 			=    6340.39 ops/sec
 	Max throughput per thread 			=    6859.59 ops/sec
 	Avg throughput per thread 			=    6682.75 ops/sec
 	Min xfer 					=  242368.00 ops

 	Children see throughput for 8 re-readers 	=   54585.11 ops/sec
 	Parent sees throughput for 8 re-readers 	=   54573.08 ops/sec
 	Min throughput per thread 			=    6022.81 ops/sec
 	Max throughput per thread 			=    7164.78 ops/sec
 	Avg throughput per thread 			=    6823.14 ops/sec
 	Min xfer 					=  220373.00 ops

 	Children see throughput for 8 reverse readers 	=   56755.70 ops/sec
 	Parent sees throughput for 8 reverse readers 	=   56667.52 ops/sec
 	Min throughput per thread 			=    5893.60 ops/sec
 	Max throughput per thread 			=    7554.16 ops/sec
 	Avg throughput per thread 			=    7094.46 ops/sec
 	Min xfer 					=  204744.00 ops

 	Children see throughput for 8 stride readers 	=   11964.43 ops/sec
 	Parent sees throughput for 8 stride readers 	=   11959.61 ops/sec
 	Min throughput per thread 			=    1353.59 ops/sec
 	Max throughput per thread 			=    1545.83 ops/sec
 	Avg throughput per thread 			=    1495.55 ops/sec
 	Min xfer 					=  229619.00 ops

 	Children see throughput for 8 random readers 	=    3314.17 ops/sec
 	Parent sees throughput for 8 random readers 	=    3314.11 ops/sec
 	Min throughput per thread 			=     367.38 ops/sec
 	Max throughput per thread 			=     482.99 ops/sec
 	Avg throughput per thread 			=     414.27 ops/sec
 	Min xfer 					=  199395.00 ops

 	Children see throughput for 8 mixed workload 	=    2438.01 ops/sec
 	Parent sees throughput for 8 mixed workload 	=    2414.88 ops/sec
 	Min throughput per thread 			=      77.17 ops/sec
 	Max throughput per thread 			=     528.42 ops/sec
 	Avg throughput per thread 			=     304.75 ops/sec
 	Min xfer 					=   38284.00 ops

 	Children see throughput for 8 random writers 	=    3176.50 ops/sec
 	Parent sees throughput for 8 random writers 	=    3141.77 ops/sec
 	Min throughput per thread 			=     394.89 ops/sec
 	Max throughput per thread 			=     400.16 ops/sec
 	Avg throughput per thread 			=     397.06 ops/sec
 	Min xfer 					=  258695.00 ops

iozone test complete.

Tharindu Rukshan Bamunuarachchi

2008-Jul-26 17:35 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080726/c839a338/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: DTool
Type: application/octet-stream
Size: 109888 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080726/c839a338/attachment.obj>
-------------- next part --------------
*******************************************************************************************************************************************************************

"The information contained in this email including in any attachment is
confidential and is meant to be read only by the person to whom it is addressed.
If you are not the intended recipient(s), you are prohibited from printing,
forwarding, saving or copying this email. If you have received this e-mail in
error, please immediately notify the sender and delete this e-mail and its
attachments from your computer."

*******************************************************************************************************************************************************************

Bob Friesenhahn

2008-Jul-26 18:29 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

On Sat, 26 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:
> It is impossible to simulate my scenario with iozone. iozone performs very
well for ZFS. OTOH,
> iozone does not measure latency.
> 
> Please find attached tool (Solaris x86), which we have written to measure
latency.
Very interesting software.  I ran it for a little while in a ZFS 
filesystem configured with 8K zfs blocksize and produced a 673MB file. 
This is with a full graphical login environment running.  I did a 
short run using 128K zfs blocksize and notice that the average peak 
latencies are about 2X as high as with 8K blocks, but the maximum peak 
latencies are similar (i.e. somewhat under 10,000 us).  I suspect that 
the maximum peak latencies have something to do with zfs itself (or 
something in the test program) rather than the pool configuration.

Here is the text output with 8k filesystem blocks:

% ./DTool -W -i 1 -s 700 -r 10000 -f file
System Tick = 100 usecs
Clock resolution 10
HR Timer created for 100usecs
z_FileName = file
i_Rate = 10000
l_BlockSize = 700
i_SyncInterval = 0
l_TickInterval = 100
i_TicksPerIO = 1
i_NumOfIOsPerSlot = 1
Max (us)| Min (us)	| Avg (us)	| MB/S		| File		Freq Distribution
   80	|  5		|  6.5637	|  3.3371	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   2116	|  4		|  7.2277	|  6.6429	|  file	   50(99.88), 200(0.10), 500(0.00),
2000(0.01), 5000(0.01), 10000(0.00), 100000(0.00), 200000(0.00),
   60	|  5		|  6.7522	|  6.6733	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   64	|  5		|  6.6542	|  6.6753	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   46	|  4		|  6.5489	|  6.6753	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   68	|  5		|  6.5236	|  6.6726	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   8694	|  4		|  8.7859	|  6.4859	|  file	   50(99.39), 200(0.54), 500(0.03),
2000(0.03), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00),
   70	|  4		|  6.5669	|  6.6753	|  file	   50(99.98), 200(0.02), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   48	|  5		|  6.5907	|  6.6733	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   49	|  5		|  6.5948	|  6.6753	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   47	|  4		|  6.5437	|  6.6753	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   7991	|  4		|  8.7452	|  6.5043	|  file	   50(99.45), 200(0.45), 500(0.06),
2000(0.03), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00),
   57	|  4		|  6.7606	|  6.6753	|  file	   50(99.98), 200(0.02), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   49	|  5		|  6.6358	|  6.6753	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   46	|  5		|  6.4603	|  6.6726	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   60	|  5		|  6.4511	|  6.6727	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   9099	|  4		|  9.0321	|  6.4891	|  file	   50(99.37), 200(0.51), 500(0.07),
2000(0.04), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00),
   48	|  5		|  6.5132	|  6.6727	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   72	|  5		|  6.5453	|  6.6726	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   44	|  5		|  6.5788	|  6.6753	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   71	|  5		|  6.5554	|  6.6727	|  file	   50(99.98), 200(0.02), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   9138	|  4		|  8.9271	|  6.5061	|  file	   50(99.43), 200(0.48), 500(0.03),
2000(0.04), 5000(0.01), 10000(0.01), 100000(0.00), 200000(0.00),
   45	|  5		|  6.5028	|  6.6753	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   51	|  5		|  6.5297	|  6.6733	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   47	|  5		|  6.6340	|  6.6727	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   49	|  4		|  6.6172	|  6.6753	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   8671	|  4		|  8.5149	|  6.4436	|  file	   50(99.58), 200(0.35), 500(0.03),
2000(0.02), 5000(0.01), 10000(0.01), 100000(0.00), 200000(0.00),
   51	|  5		|  6.5969	|  6.6754	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   523	|  4		|  6.6691	|  6.6753	|  file	   50(99.98), 200(0.01), 500(0.00),
2000(0.01), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   68	|  5		|  6.6747	|  6.6726	|  file	   50(99.98), 200(0.02), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   41	|  5		|  6.5438	|  6.6734	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   8713	|  4		|  8.8272	|  6.4589	|  file	   50(99.36), 200(0.50), 500(0.11),
2000(0.02), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00),
   50	|  4		|  6.4549	|  6.6754	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   49	|  5		|  6.5445	|  6.6726	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   69	|  5		|  6.4427	|  6.6753	|  file	   50(99.98), 200(0.02), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   71	|  5		|  6.5174	|  6.6727	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   8745	|  4		|  9.0431	|  6.4960	|  file	   50(99.34), 200(0.55), 500(0.08),
2000(0.01), 5000(0.01), 10000(0.01), 100000(0.00), 200000(0.00),
   61	|  5		|  6.6875	|  6.6727	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   62	|  5		|  6.5137	|  6.6726	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   59	|  5		|  6.6640	|  6.6727	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   206	|  4		|  6.8049	|  6.6353	|  file	   50(99.86), 200(0.13), 500(0.01),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   9237	|  4		|  8.3959	|  6.5524	|  file	   50(99.59), 200(0.30), 500(0.07),
2000(0.03), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00),
   63	|  5		|  6.4380	|  6.6726	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   44	|  5		|  6.3695	|  6.6754	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   54	|  5		|  6.5120	|  6.6753	|  file	   50(99.98), 200(0.02), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   1166	|  4		|  7.2351	|  6.5967	|  file	   50(99.67), 200(0.29), 500(0.03),
2000(0.01), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   9103	|  4		|  7.9335	|  6.5807	|  file	   50(99.83), 200(0.12), 500(0.03),
2000(0.01), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00),
   60	|  5		|  6.5613	|  6.6727	|  file	   50(99.98), 200(0.02), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   66	|  5		|  6.4940	|  6.6733	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   51	|  5		|  6.5826	|  6.6726	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   1549	|  4		|  7.8373	|  6.4979	|  file	   50(99.48), 200(0.43), 500(0.06),
2000(0.03), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   9221	|  4		|  7.6834	|  6.6720	|  file	   50(99.96), 200(0.03), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00),
   64	|  4		|  6.5624	|  6.6733	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   47	|  5		|  6.5555	|  6.6747	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   47	|  5		|  6.5979	|  6.6725	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   9219	|  4		|  9.0427	|  6.4955	|  file	   50(99.45), 200(0.43), 500(0.08),
2000(0.03), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00),
   71	|  5		|  6.4686	|  6.6725	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   59	|  5		|  6.6367	|  6.6727	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   49	|  4		|  6.6251	|  6.6752	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   59	|  5		|  6.5028	|  6.6727	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   8722	|  4		|  8.7847	|  6.5003	|  file	   50(99.21), 200(0.72), 500(0.02),
2000(0.04), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00),
   46	|  5		|  6.4410	|  6.6727	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   45	|  5		|  6.5116	|  6.6726	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   70	|  5		|  6.5629	|  6.6753	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   41	|  5		|  6.4026	|  6.6727	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   8728	|  4		|  8.5549	|  6.4809	|  file	   50(99.47), 200(0.46), 500(0.05),
2000(0.01), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00),
   64	|  5		|  6.6568	|  6.6753	|  file	   50(99.98), 200(0.02), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   59	|  5		|  6.3614	|  6.6726	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   89	|  5		|  6.5293	|  6.6726	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   47	|  5		|  6.4494	|  6.6733	|  file	   50(100.00), 200(0.00), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   8752	|  4		|  8.7413	|  6.4929	|  file	   50(99.51), 200(0.40), 500(0.04),
2000(0.04), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00),
   73	|  5		|  6.5893	|  6.6726	|  file	   50(99.97), 200(0.03), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   51	|  5		|  6.6581	|  6.6726	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   62	|  5		|  6.6587	|  6.6726	|  file	   50(99.99), 200(0.01), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   77	|  5		|  7.2512	|  6.6753	|  file	   50(99.94), 200(0.06), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   9169	|  4		|  9.7338	|  6.4627	|  file	   50(99.04), 200(0.88), 500(0.04),
2000(0.02), 5000(0.01), 10000(0.01), 100000(0.00), 200000(0.00),
   69	|  5		|  7.2407	|  6.6727	|  file	   50(99.96), 200(0.04), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   97	|  4		|  7.3353	|  6.6753	|  file	   50(99.93), 200(0.07), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   382	|  4		|  7.2239	|  6.6753	|  file	   50(99.93), 200(0.06), 500(0.01),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   109	|  4		|  7.2756	|  6.6753	|  file	   50(99.94), 200(0.06), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   8015	|  4		|  9.7219	|  6.4942	|  file	   50(98.94), 200(0.92), 500(0.10),
2000(0.03), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00),
   70	|  5		|  7.2720	|  6.6753	|  file	   50(99.92), 200(0.08), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   99	|  5		|  7.3499	|  6.6753	|  file	   50(99.90), 200(0.10), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   64	|  4		|  7.2259	|  6.6753	|  file	   50(99.96), 200(0.04), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   69	|  4		|  7.2479	|  6.6753	|  file	   50(99.94), 200(0.06), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   8778	|  4		|  10.0523	|  6.4904	|  file	   50(98.99), 200(0.88), 500(0.08),
2000(0.03), 5000(0.01), 10000(0.01), 100000(0.00), 200000(0.00),
   72	|  5		|  7.3874	|  6.6753	|  file	   50(99.90), 200(0.10), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   69	|  5		|  7.4387	|  6.6726	|  file	   50(99.94), 200(0.06), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   69	|  5		|  7.4582	|  6.6726	|  file	   50(99.95), 200(0.05), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   71	|  4		|  7.4296	|  6.6753	|  file	   50(99.91), 200(0.09), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   8756	|  4		|  9.4746	|  6.4835	|  file	   50(99.08), 200(0.85), 500(0.04),
2000(0.02), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00),
   70	|  5		|  7.4215	|  6.6753	|  file	   50(99.96), 200(0.04), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   66	|  4		|  7.3560	|  6.6753	|  file	   50(99.94), 200(0.06), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   65	|  4		|  7.3263	|  6.6754	|  file	   50(99.94), 200(0.06), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   87	|  4		|  7.2929	|  6.6753	|  file	   50(99.95), 200(0.05), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   8753	|  4		|  9.4104	|  6.4477	|  file	   50(99.16), 200(0.77), 500(0.03),
2000(0.03), 5000(0.00), 10000(0.01), 100000(0.00), 200000(0.00),
   71	|  4		|  7.4860	|  6.6754	|  file	   50(99.90), 200(0.10), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   66	|  5		|  7.2805	|  6.6753	|  file	   50(99.96), 200(0.04), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   78	|  5		|  7.4491	|  6.6740	|  file	   50(99.92), 200(0.08), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),
   75	|  5		|  7.4095	|  6.6753	|  file	   50(99.95), 200(0.05), 500(0.00),
2000(0.00), 5000(0.00), 10000(0.00), 100000(0.00), 200000(0.00),

=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2008-Jul-26 19:03 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

On Sat, 26 Jul 2008, Bob Friesenhahn wrote:
> I suspect that the maximum peak latencies have something to do with 
> zfs itself (or something in the test program) rather than the pool 
> configuration.
As confirmation that the reported timings have virtually nothing to do 
with the pool configuration, I ran the program on a two-drive ZFS 
mirror pool consisting of two cheap 500MB USB drives.  The average 
latency was not much worse.  The peak latency values are often larger 
but the maximum peak is still on the order of 9000 microseconds.

I then ran the test on a single-drive UFS filesystem (300GB 15K RPM 
SAS drive) which is freshly created and see that the average latency 
is somewhat lower but the maximum peak for each interval is typically 
much higher (at least 1200 but often 4000). I even saw a measured peak 
as high as 22224.

Based on the findings, it seems that using the 2540 is a complete 
waste if two cheap USB drives in a zfs mirror pool can almost obtain 
the same timings.  UFS on the fast SAS drive performed worse.

I did not run your program in a real-time scheduling class (see 
priocntl).  Perhaps it would perform better using real-time 
scheduling.  It might also do better in a fixed-priority class.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2008-Jul-26 19:33 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

Bob Friesenhahn wrote:> On Sat, 26 Jul 2008, Bob Friesenhahn wrote:
>
>   
>> I suspect that the maximum peak latencies have something to do with 
>> zfs itself (or something in the test program) rather than the pool 
>> configuration.
>>     
>
> As confirmation that the reported timings have virtually nothing to do 
> with the pool configuration, I ran the program on a two-drive ZFS 
> mirror pool consisting of two cheap 500MB USB drives.  The average 
> latency was not much worse.  The peak latency values are often larger 
> but the maximum peak is still on the order of 9000 microseconds.
>   
Is it doing buffered or sync writes?  I''ll try it later today or
tomorrow...
> I then ran the test on a single-drive UFS filesystem (300GB 15K RPM 
> SAS drive) which is freshly created and see that the average latency 
> is somewhat lower but the maximum peak for each interval is typically 
> much higher (at least 1200 but often 4000). I even saw a measured peak 
> as high as 22224.
>
> Based on the findings, it seems that using the 2540 is a complete 
> waste if two cheap USB drives in a zfs mirror pool can almost obtain 
> the same timings.  UFS on the fast SAS drive performed worse.
>
> I did not run your program in a real-time scheduling class (see 
> priocntl).  Perhaps it would perform better using real-time 
> scheduling.  It might also do better in a fixed-priority class.
>   
This might be more important.  But a better solution is to assign a
processor set to run only the application -- a good idea any time you
need a predictable response.
 -- richard

Bob Friesenhahn

2008-Jul-26 19:58 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

On Sat, 26 Jul 2008, Richard Elling wrote:>
> Is it doing buffered or sync writes?  I''ll try it later today or
> tomorrow...
I have not seen the source code but truss shows that this program is 
doing more than expected such as using send/recv to send a message. 
In fact, send(), pollsys(), recv(), and write() constitute most of the 
activity.  A POSIX.4 real-time timer is created. Perhaps it uses two 
threads, with one sending messages to the other over a socketpair, and 
the second thread does the actual write.
>> I did not run your program in a real-time scheduling class (see
priocntl).
>> Perhaps it would perform better using real-time scheduling.  It might
also
>> do better in a fixed-priority class.
>
> This might be more important.  But a better solution is to assign a
> processor set to run only the application -- a good idea any time you
> need a predictable response.
Later on I did try running the program in the real time scheduling 
class with high priority and it made no difference at all.

While it is clear that filesystem type (ZFS or UFS) does make a 
significant difference, it seems that the program is doing more than 
simply timing the write system call.  A defect in the program could 
easily account for the long delays.

It would help if source code for the program can be posted.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ellis, Mike

2008-Jul-26 19:59 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

Bob Says: 

	"But a better solution is to assign a processor set to run only
the application -- a good idea any time you need a predictable
response."

Bob''s suggestion above along with "no interrupts on that
pset", and a
fixed scheduling class for the application/processes in question could
also be helpful.

Tharindu, would you be able to share the source of your
write-latency-measuring application? This might give us a better idea of
exactly what its measuring and how. This might allow people (way smarter
than me) to do some additional/alternative DTRACE work to help further
drill down towards the source-and-resolution of the issue.

Thanks,

 -- MikeE
 

-----Original Message-----
From: zfs-discuss-bounces at opensolaris.org
[mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Richard Elling
Sent: Saturday, July 26, 2008 3:33 PM
To: Bob Friesenhahn
Cc: zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

Bob Friesenhahn wrote:> On Sat, 26 Jul 2008, Bob Friesenhahn wrote:
>
>   
>> I suspect that the maximum peak latencies have something to do with 
>> zfs itself (or something in the test program) rather than the pool 
>> configuration.
>>     
>
> As confirmation that the reported timings have virtually nothing to do
> with the pool configuration, I ran the program on a two-drive ZFS 
> mirror pool consisting of two cheap 500MB USB drives.  The average 
> latency was not much worse.  The peak latency values are often larger 
> but the maximum peak is still on the order of 9000 microseconds.
>   
Is it doing buffered or sync writes?  I''ll try it later today or
tomorrow...
> I then ran the test on a single-drive UFS filesystem (300GB 15K RPM 
> SAS drive) which is freshly created and see that the average latency 
> is somewhat lower but the maximum peak for each interval is typically 
> much higher (at least 1200 but often 4000). I even saw a measured peak
> as high as 22224.
>
> Based on the findings, it seems that using the 2540 is a complete 
> waste if two cheap USB drives in a zfs mirror pool can almost obtain 
> the same timings.  UFS on the fast SAS drive performed worse.
>
> I did not run your program in a real-time scheduling class (see 
> priocntl).  Perhaps it would perform better using real-time 
> scheduling.  It might also do better in a fixed-priority class.
>   
This might be more important.  But a better solution is to assign a
processor set to run only the application -- a good idea any time you
need a predictable response.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Tharindu Rukshan Bamunuarachchi

2008-Jul-28 12:43 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080728/a1200262/attachment.html>
-------------- next part --------------
*******************************************************************************************************************************************************************

"The information contained in this email including in any attachment is
confidential and is meant to be read only by the person to whom it is addressed.
If you are not the intended recipient(s), you are prohibited from printing,
forwarding, saving or copying this email. If you have received this e-mail in
error, please immediately notify the sender and delete this e-mail and its
attachments from your computer."

*******************************************************************************************************************************************************************

Bob Friesenhahn

2008-Jul-28 15:02 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

On Mon, 28 Jul 2008, Tharindu Rukshan Bamunuarachchi
wrote:> 
> I have tried your pdf but did not get good latency numbers even after array
> tuning...
Right.  And since I observed only slightly less optimal performance 
from a mirror pair of USB drives it seems that your requirement is not 
challenging at all for the storage hardware.  USB does not offer very 
much throughput.  A $200 portable disk drive is able to almost match a 
$23K drive array for this application.

I did test your application with output to /dev/null and /tmp (memory 
based) and it did report consistently tiny numbers in that case.

It seems likely that writes to ZFS encounter a "hickup" every so often
in the write system call.

There is still the possibility that a bug in your application is 
causing the "hickup" in the write timings.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Robert Milkowski

2008-Jul-29 23:49 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

Hello Bob,

Friday, July 25, 2008, 4:58:54 PM, you wrote:

BF> On Fri, 25 Jul 2008, Robert Milkowski wrote:
>> Both on 2540 and 6540 if you do not disable it your performance will 
>> be very bad especially for synchronous IOs as ZIL will force your 
>> array to flush its cache every time. If you are not using ZFS on any 
>> other storage than 2540 on your servers then put "set 
>> zfs:zfs_nocacheflush=1" in /etc/system and do a reboot. If you 
>> haven''t done so it should help you considerably.
BF> This does not seem wise since then data (records of trades) may be 
BF> lost if the system crashes or loses power.  It is much better to apply
BF> the firmware tweaks so that the 2540 reports that the data is written 
BF> as soon as it is safely in its NVRAM rather than waiting for it to be 
BF> on disk.  ZFS should then perform rather well with low latency. 

Both cases are basically the same.
Please notice I''m not talking about disabling ZIL, I''m talking
about
disabling cache flushes in ZFS. ZFS will still wait for the array to
confirm that it did receive data (nvram).

If you loose power the behavior will be the same - no difference here.




-- 
Best regards,
 Robert Milkowski                            mailto:milek at task.gda.pl
                                       http://milek.blogspot.com

Bob Friesenhahn

2008-Jul-30 02:07 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

On Wed, 30 Jul 2008, Robert Milkowski wrote:>
> Both cases are basically the same.
> Please notice I''m not talking about disabling ZIL, I''m
talking about
> disabling cache flushes in ZFS. ZFS will still wait for the array to
> confirm that it did receive data (nvram).
So it seems that in your opinion, the periodic "burp" in system call 
completion time is due to ZFS''s periodic cache flush.  That is 
certainly quite possible.

Testing will prove it, but the testing can be on someone else''s system 
rather than my own. :)

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Robert Milkowski

2008-Jul-30 09:26 UTC

head link

[zfs-discuss] [zfs-code] Peak every 4-5 second

Hello Bob,

Wednesday, July 30, 2008, 3:07:05 AM, you wrote:

BF> On Wed, 30 Jul 2008, Robert Milkowski wrote:>>
>> Both cases are basically the same.
>> Please notice I''m not talking about disabling ZIL,
I''m talking about
>> disabling cache flushes in ZFS. ZFS will still wait for the array to
>> confirm that it did receive data (nvram).
BF> So it seems that in your opinion, the periodic "burp" in system
call
BF> completion time is due to ZFS''s periodic cache flush.  That is 
BF> certainly quite possible.


Could be. Additionally he will end-up with up-to 35 IOs queued per
each LUN and if he doesn''t effectively have a nvram cache there the
latency can dramaticly increase during these periods.


-- 
Best regards,
 Robert Milkowski                            mailto:milek at task.gda.pl
                                       http://milek.blogspot.com

zfs discuss - Jul 2008 - Peak every 4-5 second

[zfs-code] Peak every 4-5 second

[zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

Re: [zfs-code] Peak every 4-5 second

Re: [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second

[zfs-discuss] [zfs-code] Peak every 4-5 second