Hi.
Server is v240 with 2x 1.5GHz CPUs and 4GB of RAM. ZFS is based on b32.
There''re 8x 3511 SATA JBODS with 500GB disks (96 disks in total)
connected using two loops (one dual port fc card) and MPxIO. One large pool is
created (stripe) as:
zpool create large c4t22E4000A3300FC31d0
zpool add large c4t22E4000A3300FC32d0
zpool add large c4t22E4000A3300FC33d0
zpool add large c4t22E4000A3300FC34d0
zpool add large c4t22E4000A3300FC35d0
zpool add large c4t22E4000A3300FC36d0
zpool add large c4t22E4000A3300FC37d0
zpool add large c4t22E4000A3300FC38d0
zpool add large c4t22E4000A3300FC39d0
zpool add large c4t22E4000A3300FC3Ad0
zpool add large c4t22E4000A3300FC3Bd0
zpool add large c4t22E4000A3300FC3Cd0
zpool add large c4t22E4000A3300FC0Dd0
zpool add large c4t22E4000A3300FC0Ed0
zpool add large c4t22E4000A3300FC0Fd0
[...]
Now I''m doing:
bash-3.00# dd if=/dev/zero of=/large/q1 bs=128k
^C36553+0 records in
36553+0 records out
bash-3.00#
At the same time zpool iostat was running:
bash-3.00# zpool iostat 1
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
large 4.78M 43.5T 0 100 48.5K 843K
large 4.78M 43.5T 0 0 0 0
large 4.78M 43.5T 0 1.20K 0 153M
large 629M 43.5T 0 1.10K 0 138M
large 630M 43.5T 0 16 0 139K
large 630M 43.5T 0 495 0 61.9M
large 630M 43.5T 0 1.74K 0 222M
large 630M 43.5T 0 1.71K 0 220M
large 630M 43.5T 0 1.74K 0 222M
large 1.45G 43.5T 0 863 0 104M
large 1.45G 43.5T 0 1.16K 0 147M
large 1.45G 43.5T 0 1.79K 0 229M
large 1.45G 43.5T 0 1.86K 0 238M
large 2.22G 43.5T 0 364 0 42.1M
large 2.22G 43.5T 0 549 0 67.8M
large 2.22G 43.5T 0 1.73K 0 221M
large 2.22G 43.5T 0 1.72K 0 221M
large 2.22G 43.5T 0 1.80K 0 230M
large 3.09G 43.5T 0 425 0 49.0M
large 3.09G 43.5T 0 1.40K 0 179M
large 3.09G 43.5T 0 1.85K 0 237M
large 3.09G 43.5T 0 1.92K 0 245M
large 3.09G 43.5T 0 1.89K 0 242M
large 3.09G 43.5T 0 1.91K 0 244M
large 3.09G 43.5T 0 1.05K 0 134M
large 3.09G 43.5T 0 22 0 1.49M
large 4.38G 43.5T 8 767 559K 92.1M
large 4.47G 43.5T 0 43 0 676K
large 4.47G 43.5T 0 0 0 0
^C
bash-3.00#
As you can see write througput is down to even 1MB/s for a moment. Sometimes
it''s that low even for 2-4 seconds.
Reading from such created file is ok - I mean I get steady stream with about
160MB/s (single dd) or 240MB/s when zfetch_block_cap is increased to 256.
I don''t see any errors on FC links:
bash-3.00# fcinfo hba-port -l
HBA Port WWN: 210000e08b82cf2e
OS Device Name: /dev/cfg/c2
Manufacturer: QLogic Corp.
Model: 375-3108-xx
Type: L-port
State: online
Supported Speeds: 1Gb 2Gb
Current Speed: 2Gb
Node WWN: 200000e08b82cf2e
Link Error Statistics:
Link Failure Count: 0
Loss of Sync Count: 0
Loss of Signal Count: 0
Primitive Seq Protocol Error Count: 0
Invalid Tx Word Count: 0
Invalid CRC Count: 0
HBA Port WWN: 210100e08ba2cf2e
OS Device Name: /dev/cfg/c3
Manufacturer: QLogic Corp.
Model: 375-3108-xx
Type: L-port
State: online
Supported Speeds: 1Gb 2Gb
Current Speed: 2Gb
Node WWN: 200100e08ba2cf2e
Link Error Statistics:
Link Failure Count: 0
Loss of Sync Count: 0
Loss of Signal Count: 0
Primitive Seq Protocol Error Count: 0
Invalid Tx Word Count: 0
Invalid CRC Count: 0
bash-3.00#
Using iostat I can see that there''s queue to some disks, for example:
0.0 21.0 0.0 2689.4 0.0 17.1 0.0 812.5 0 100
c4t22E4000A3300FC3Bd0
0.0 4.0 0.0 512.3 18.0 1.0 4488.3 249.9 100 100
c4t22E4000A3300FB5Dd0
0.0 21.0 0.0 2689.4 0.0 21.7 0.0 1030.4 0 100
c4t22E4000A3300FB7Dd0
But disks with such a bahaviour are different from second to second so I
don''t belive there''s a problem with disks.
Changing block size in dd command does not change that behaviour.
ps. SDR-0169 was created to cover this.
This message posted from opensolaris.org
On 3/23/06, Robert Milkowski <milek at task.gda.pl> wrote:> > Hi. > > Server is v240 with 2x 1.5GHz CPUs and 4GB of RAM. ZFS is based on b32. > There''re 8x 3511 SATA JBODS with 500GB disks (96 disks in total) connected > using two loops (one dual port fc card) and MPxIO. One large pool is created > (stripe) as: > > [...] > > As you can see write througput is down to even 1MB/s for a moment. > Sometimes it''s that low even for 2-4 seconds. > > Reading from such created file is ok - I mean I get steady stream with > about 160MB/s (single dd) or 240MB/s when zfetch_block_cap is increased to > 256. >I''ve seen this quite often. My thought is this could just be data buffering. When data is buffered in memory, the actual disk writes go down, when memory is filled up to a certain threshold, data are flushed out again. Read causes a pagein immediately, if the data is not already in memory. Sounds obvious, but I can''t think of an easy way to prove/disprove it. It may very well be something else, but I wouldn''t worry too much unless the _average_ write throught is lower than I expected. The interesting question would be what the thresholds/algorithm are and what can be tuned. What if I don''t want to buffer the data or only buffer for small/random I/Os, something like the "The release-behind mechanism<http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.doc/aixbman/prftungd/fsperf3.htm#j12a04n2001relbeh>" option in AIX: http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.doc/aixbman/prftungd/fsperf3.htm Tao -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060323/9404816f/attachment.html>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><title></title>
<META http-equiv=Content-Type content="text/html;
charset=iso-8859-1">
<meta http-equiv="Content-Style-Type"
content="text/css">
<style type="text/css"><!--
body {
margin: 5px 5px 5px 5px;
background-color: #ffffff;
}
/* ---------- Text Styles ---------- */
hr { color: #000000}
body, table /* Normal text */
{
font-size: 9pt;
font-family: ''Courier New'';
font-style: normal;
font-weight: normal;
color: #000000;
text-decoration: none;
}
span.rvts1 /* Heading */
{
font-size: 10pt;
font-family: ''Arial'';
font-weight: bold;
color: #0000ff;
}
span.rvts2 /* Subheading */
{
font-size: 10pt;
font-family: ''Arial'';
font-weight: bold;
color: #000080;
}
span.rvts3 /* Keywords */
{
font-size: 10pt;
font-family: ''Arial'';
font-style: italic;
color: #800000;
}
a.rvts4, span.rvts4 /* Jump 1 */
{
font-size: 10pt;
font-family: ''Arial'';
color: #008000;
text-decoration: underline;
}
a.rvts5, span.rvts5 /* Jump 2 */
{
font-size: 10pt;
font-family: ''Arial'';
color: #008000;
text-decoration: underline;
}
span.rvts6
{
font-family: ''tahoma'';
font-weight: bold;
color: #ffffff;
}
span.rvts7
{
font-family: ''tahoma'';
}
span.rvts8
{
font-family: ''tahoma'';
font-weight: bold;
}
a.rvts9, span.rvts9
{
font-family: ''tahoma'';
color: #0000ff;
text-decoration: underline;
}
span.rvts10
{
font-size: 7pt;
font-family: ''arial'';
font-style: italic;
color: #c0c0c0;
}
a.rvts11, span.rvts11
{
font-size: 7pt;
font-family: ''arial'';
color: #0000ff;
text-decoration: underline;
}
/* ---------- Para Styles ---------- */
p,ul,ol /* Paragraph Style */
{
text-align: left;
text-indent: 0px;
padding: 0px 0px 0px 0px;
margin: 0px 0px 0px 0px;
}
.rvps1 /* Centered */
{
text-align: center;
}
--></style>
</head>
<body>
<p>Hello Tao,</p>
<p><br></p>
<p>Thursday, March 23, 2006, 10:08:42 PM, you wrote:</p>
<p><br></p>
<div><table border=0 cellpadding=1 cellspacing=2
style="border-color: #000000; border-style: solid;">
<tr valign=top>
<td width=12 style="background-color: #0000ff;">
<p><span class=rvts6>></span></p>
</td>
<td width=713 style="background-color: #ffffff;">
<p><br></p>
<p><br></p>
<p><span class=rvts7>On 3/23/06, </span><span
class=rvts8>Robert Milkowski</span><span class=rvts7>
<</span><a class=rvts9
href="mailto:milek@task.gda.pl">milek@task.gda.pl</a><span
class=rvts7>> wrote:</span></p>
<p><span class=rvts7>Hi.</span></p>
<p><br></p>
<p><span class=rvts7> Server is v240 with 2x 1.5GHz CPUs and 4GB of
RAM. ZFS is based on b32. There''re 8x 3511 SATA JBODS with 500GB disks
(96 disks in total) connected using two loops (one dual port fc card) and MPxIO.
One large pool is created (stripe) as: </span></p>
<p><br></p>
<p><span class=rvts7>[...]</span></p>
<p><br></p>
<p><span class=rvts7>As you can see write througput is down to even
1MB/s for a moment. Sometimes it''s that low even for 2-4
seconds.</span></p>
<p><br></p>
<p><span class=rvts7>Reading from such created file is ok - I mean I
get steady stream with about 160MB/s (single dd) or 240MB/s when
zfetch_block_cap is increased to 256. </span></p>
<p><br></p>
<p><span class=rvts7>I''ve seen this quite often.
</span></p>
<p><span class=rvts7>My thought is this could just be data
buffering. </span></p>
<p><br></p>
<p><span class=rvts7>When data is buffered in memory, the actual
disk writes go down, when memory is filled up to a certain threshold, data are
flushed out again. </span></p>
<p><br></p>
<p><span class=rvts7>Read causes a pagein immediately, if the data
is not already in memory.</span></p>
<p><br></p>
<p><span class=rvts7>Sounds obvious, but I can''t think of
an easy way to prove/disprove it.</span></p>
<p><span class=rvts7>It may very well be something else, but I
wouldn''t worry too much unless the _average_ write throught is lower
than I expected. </span></p>
<p><br></p>
<p><span class=rvts7>The interesting question would be what the
thresholds/algorithm are and what can be tuned.</span></p>
<p><span class=rvts7>What if I don''t want to buffer the
data or only buffer for small/random I/Os, something like the
"</span><a class=rvts9
href="http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.doc/aixbman/prftungd/fsperf3.htm#j12a04n2001relbeh">
The release-behind mechanism</a><span class=rvts7>" option in
AIX:</span></p>
<p><a class=rvts9
href="http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.doc/aixbman/prftungd/fsperf3.htm">http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.doc/aixbman/prftungd/fsperf3.htm
</a></p>
<p><br></p>
<p><br></p>
<p><span class=rvts7>Tao</span></p>
<p><br></p>
</td>
</tr>
</table>
</div>
<p><br></p>
<p>I don''t know - wrting from /dev/zero shold be fast enough to
keep write cache saturated all the time.</p>
<p><br></p>
<p><br></p>
<p><span class=rvts10>-- </span></p>
<p><span class=rvts10>Best regards,</span></p>
<p><span class=rvts10> Robert
</span><a class=rvts11
href="mailto:rmilkowski@task.gda.pl">mailto:rmilkowski@task.gda.pl</a></p>
<p><span class=rvts10>
</span><a class=rvts11
href="http://milek.blogspot.com">http://milek.blogspot.com</a></p>
</body></html>
bash-3.00# dtrace -n syscall::write:entry''/execname ==
"dd"/{self->t=timestamp;}'' -n
syscall::write:return''/self->t/{@=quantize(timestamp-self->t);self->t=0;}''
-n tick-1s''{printa(@);}''
0 43451 :tick-1s
value ------------- Distribution ------------- count
32768 | 0
65536 | 1149
131072 |@@@@@@@@@@@@@@@@ 45126
262144 |@@ 6957
524288 |@@@@@@@@@@@@@@@@@@@@@@ 62123
1048576 | 1
2097152 | 0
4194304 | 0
8388608 | 0
16777216 | 0
33554432 | 0
67108864 | 0
134217728 | 0
268435456 | 0
536870912 | 0
1073741824 | 0
2147483648 | 1
4294967296 | 10
8589934592 | 0
^C
Everytime I saw a decrease in write throutput there was long write (those 11
above 2s).
So I wrote small DTrace script which catches up write requests which lasted
more than 4s and then in fbt:zfs probes longer than 1s. Here''s result:
bash-3.00# ./long_write-3.d
txg_wait_open 4088517163
zfs`zfs_write+0x308
genunix`write+0x268
genunix`dtrace_systrace_syscall32+0xa8
unix`syscall_trap32+0xcc
libc.so.1`_write+0x8
dd`flsh+0x70
dd`main+0x11ac
dd`_start+0x108
WRITE-return LONG 4090212496
txg_wait_open 4160249413
zfs`zfs_write+0x308
genunix`write+0x268
genunix`dtrace_systrace_syscall32+0xa8
unix`syscall_trap32+0xcc
libc.so.1`_write+0x8
dd`flsh+0x70
dd`main+0x11ac
dd`_start+0x108
WRITE-return LONG 4162075079
txg_wait_open 4085121246
zfs`zfs_write+0x308
genunix`write+0x268
genunix`dtrace_systrace_syscall32+0xa8
unix`syscall_trap32+0xcc
libc.so.1`_write+0x8
dd`flsh+0x70
dd`main+0x11ac
dd`_start+0x108
WRITE-return LONG 4086859663
^C
bash-3.00#
So dd is hanging in write() syscall for more than 4s while write()
most of the time spends in txg_wait_open().
vtimes for txg_wait_open are many orders of magnitude lower.
Code for the function is:
383 txg_wait_open(dsl_pool_t *dp, uint64_t txg)
384 {
385 tx_state_t *tx = &dp->dp_tx;
386
387 mutex_enter(&tx->tx_sync_lock);
388 ASSERT(tx->tx_threads == 3);
389 if (txg == 0)
390 txg = tx->tx_open_txg + 1;
391 if (tx->tx_quiesce_txg_waiting < txg)
392 tx->tx_quiesce_txg_waiting = txg;
393 dprintf("txg=%llu quiesce_txg=%llu sync_txg=%llu\n",
394 txg, tx->tx_quiesce_txg_waiting,
tx->tx_sync_txg_waiting);
395 while (tx->tx_open_txg < txg) {
396 cv_broadcast(&tx->tx_quiesce_more_cv);
397 cv_wait(&tx->tx_quiesce_done_cv,
&tx->tx_sync_lock);
398 }
399 mutex_exit(&tx->tx_sync_lock);
400 }
Any ideas?
bash-3.00# cat long_write-3.d
#!/usr/sbin/dtrace -qs
#pragma D option nspec=100
syscall::write:entry
/ execname == "dd" /
{
self->t = timestamp;
self->bs = arg2;
}
syscall::write:return
/ self->t && (timestamp - self->t) > 4000000000 /
{
printf("WRITE-return LONG\t%d\n\n", timestamp - self->t);
self->t = 0;
}
syscall::write:return
/ self->t /
{
self->t = 0;
self->spec = 0;
}
fbt:zfs::entry
/ self->t /
{
self->ft = timestamp;
}
fbt:zfs::return
/ self->ft && (timestamp - self->ft) > 1000000000/
{
printf("%-20s %d\n", probefunc, timestamp - self->ft);
stack();
ustack();
self->ft = 0;
}
bash-3.00#
--
Best regards,
Robert mailto:rmilkowski at task.gda.pl
http://milek.blogspot.com