Hi. Server is v240 with 2x 1.5GHz CPUs and 4GB of RAM. ZFS is based on b32. There''re 8x 3511 SATA JBODS with 500GB disks (96 disks in total) connected using two loops (one dual port fc card) and MPxIO. One large pool is created (stripe) as: zpool create large c4t22E4000A3300FC31d0 zpool add large c4t22E4000A3300FC32d0 zpool add large c4t22E4000A3300FC33d0 zpool add large c4t22E4000A3300FC34d0 zpool add large c4t22E4000A3300FC35d0 zpool add large c4t22E4000A3300FC36d0 zpool add large c4t22E4000A3300FC37d0 zpool add large c4t22E4000A3300FC38d0 zpool add large c4t22E4000A3300FC39d0 zpool add large c4t22E4000A3300FC3Ad0 zpool add large c4t22E4000A3300FC3Bd0 zpool add large c4t22E4000A3300FC3Cd0 zpool add large c4t22E4000A3300FC0Dd0 zpool add large c4t22E4000A3300FC0Ed0 zpool add large c4t22E4000A3300FC0Fd0 [...] Now I''m doing: bash-3.00# dd if=/dev/zero of=/large/q1 bs=128k ^C36553+0 records in 36553+0 records out bash-3.00# At the same time zpool iostat was running: bash-3.00# zpool iostat 1 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- large 4.78M 43.5T 0 100 48.5K 843K large 4.78M 43.5T 0 0 0 0 large 4.78M 43.5T 0 1.20K 0 153M large 629M 43.5T 0 1.10K 0 138M large 630M 43.5T 0 16 0 139K large 630M 43.5T 0 495 0 61.9M large 630M 43.5T 0 1.74K 0 222M large 630M 43.5T 0 1.71K 0 220M large 630M 43.5T 0 1.74K 0 222M large 1.45G 43.5T 0 863 0 104M large 1.45G 43.5T 0 1.16K 0 147M large 1.45G 43.5T 0 1.79K 0 229M large 1.45G 43.5T 0 1.86K 0 238M large 2.22G 43.5T 0 364 0 42.1M large 2.22G 43.5T 0 549 0 67.8M large 2.22G 43.5T 0 1.73K 0 221M large 2.22G 43.5T 0 1.72K 0 221M large 2.22G 43.5T 0 1.80K 0 230M large 3.09G 43.5T 0 425 0 49.0M large 3.09G 43.5T 0 1.40K 0 179M large 3.09G 43.5T 0 1.85K 0 237M large 3.09G 43.5T 0 1.92K 0 245M large 3.09G 43.5T 0 1.89K 0 242M large 3.09G 43.5T 0 1.91K 0 244M large 3.09G 43.5T 0 1.05K 0 134M large 3.09G 43.5T 0 22 0 1.49M large 4.38G 43.5T 8 767 559K 92.1M large 4.47G 43.5T 0 43 0 676K large 4.47G 43.5T 0 0 0 0 ^C bash-3.00# As you can see write througput is down to even 1MB/s for a moment. Sometimes it''s that low even for 2-4 seconds. Reading from such created file is ok - I mean I get steady stream with about 160MB/s (single dd) or 240MB/s when zfetch_block_cap is increased to 256. I don''t see any errors on FC links: bash-3.00# fcinfo hba-port -l HBA Port WWN: 210000e08b82cf2e OS Device Name: /dev/cfg/c2 Manufacturer: QLogic Corp. Model: 375-3108-xx Type: L-port State: online Supported Speeds: 1Gb 2Gb Current Speed: 2Gb Node WWN: 200000e08b82cf2e Link Error Statistics: Link Failure Count: 0 Loss of Sync Count: 0 Loss of Signal Count: 0 Primitive Seq Protocol Error Count: 0 Invalid Tx Word Count: 0 Invalid CRC Count: 0 HBA Port WWN: 210100e08ba2cf2e OS Device Name: /dev/cfg/c3 Manufacturer: QLogic Corp. Model: 375-3108-xx Type: L-port State: online Supported Speeds: 1Gb 2Gb Current Speed: 2Gb Node WWN: 200100e08ba2cf2e Link Error Statistics: Link Failure Count: 0 Loss of Sync Count: 0 Loss of Signal Count: 0 Primitive Seq Protocol Error Count: 0 Invalid Tx Word Count: 0 Invalid CRC Count: 0 bash-3.00# Using iostat I can see that there''s queue to some disks, for example: 0.0 21.0 0.0 2689.4 0.0 17.1 0.0 812.5 0 100 c4t22E4000A3300FC3Bd0 0.0 4.0 0.0 512.3 18.0 1.0 4488.3 249.9 100 100 c4t22E4000A3300FB5Dd0 0.0 21.0 0.0 2689.4 0.0 21.7 0.0 1030.4 0 100 c4t22E4000A3300FB7Dd0 But disks with such a bahaviour are different from second to second so I don''t belive there''s a problem with disks. Changing block size in dd command does not change that behaviour. ps. SDR-0169 was created to cover this. This message posted from opensolaris.org
On 3/23/06, Robert Milkowski <milek at task.gda.pl> wrote:> > Hi. > > Server is v240 with 2x 1.5GHz CPUs and 4GB of RAM. ZFS is based on b32. > There''re 8x 3511 SATA JBODS with 500GB disks (96 disks in total) connected > using two loops (one dual port fc card) and MPxIO. One large pool is created > (stripe) as: > > [...] > > As you can see write througput is down to even 1MB/s for a moment. > Sometimes it''s that low even for 2-4 seconds. > > Reading from such created file is ok - I mean I get steady stream with > about 160MB/s (single dd) or 240MB/s when zfetch_block_cap is increased to > 256. >I''ve seen this quite often. My thought is this could just be data buffering. When data is buffered in memory, the actual disk writes go down, when memory is filled up to a certain threshold, data are flushed out again. Read causes a pagein immediately, if the data is not already in memory. Sounds obvious, but I can''t think of an easy way to prove/disprove it. It may very well be something else, but I wouldn''t worry too much unless the _average_ write throught is lower than I expected. The interesting question would be what the thresholds/algorithm are and what can be tuned. What if I don''t want to buffer the data or only buffer for small/random I/Os, something like the "The release-behind mechanism<http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.doc/aixbman/prftungd/fsperf3.htm#j12a04n2001relbeh>" option in AIX: http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.doc/aixbman/prftungd/fsperf3.htm Tao -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060323/9404816f/attachment.html>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html><head><title></title> <META http-equiv=Content-Type content="text/html; charset=iso-8859-1"> <meta http-equiv="Content-Style-Type" content="text/css"> <style type="text/css"><!-- body { margin: 5px 5px 5px 5px; background-color: #ffffff; } /* ---------- Text Styles ---------- */ hr { color: #000000} body, table /* Normal text */ { font-size: 9pt; font-family: ''Courier New''; font-style: normal; font-weight: normal; color: #000000; text-decoration: none; } span.rvts1 /* Heading */ { font-size: 10pt; font-family: ''Arial''; font-weight: bold; color: #0000ff; } span.rvts2 /* Subheading */ { font-size: 10pt; font-family: ''Arial''; font-weight: bold; color: #000080; } span.rvts3 /* Keywords */ { font-size: 10pt; font-family: ''Arial''; font-style: italic; color: #800000; } a.rvts4, span.rvts4 /* Jump 1 */ { font-size: 10pt; font-family: ''Arial''; color: #008000; text-decoration: underline; } a.rvts5, span.rvts5 /* Jump 2 */ { font-size: 10pt; font-family: ''Arial''; color: #008000; text-decoration: underline; } span.rvts6 { font-family: ''tahoma''; font-weight: bold; color: #ffffff; } span.rvts7 { font-family: ''tahoma''; } span.rvts8 { font-family: ''tahoma''; font-weight: bold; } a.rvts9, span.rvts9 { font-family: ''tahoma''; color: #0000ff; text-decoration: underline; } span.rvts10 { font-size: 7pt; font-family: ''arial''; font-style: italic; color: #c0c0c0; } a.rvts11, span.rvts11 { font-size: 7pt; font-family: ''arial''; color: #0000ff; text-decoration: underline; } /* ---------- Para Styles ---------- */ p,ul,ol /* Paragraph Style */ { text-align: left; text-indent: 0px; padding: 0px 0px 0px 0px; margin: 0px 0px 0px 0px; } .rvps1 /* Centered */ { text-align: center; } --></style> </head> <body> <p>Hello Tao,</p> <p><br></p> <p>Thursday, March 23, 2006, 10:08:42 PM, you wrote:</p> <p><br></p> <div><table border=0 cellpadding=1 cellspacing=2 style="border-color: #000000; border-style: solid;"> <tr valign=top> <td width=12 style="background-color: #0000ff;"> <p><span class=rvts6>></span></p> </td> <td width=713 style="background-color: #ffffff;"> <p><br></p> <p><br></p> <p><span class=rvts7>On 3/23/06, </span><span class=rvts8>Robert Milkowski</span><span class=rvts7> <</span><a class=rvts9 href="mailto:milek@task.gda.pl">milek@task.gda.pl</a><span class=rvts7>> wrote:</span></p> <p><span class=rvts7>Hi.</span></p> <p><br></p> <p><span class=rvts7> Server is v240 with 2x 1.5GHz CPUs and 4GB of RAM. ZFS is based on b32. There''re 8x 3511 SATA JBODS with 500GB disks (96 disks in total) connected using two loops (one dual port fc card) and MPxIO. One large pool is created (stripe) as: </span></p> <p><br></p> <p><span class=rvts7>[...]</span></p> <p><br></p> <p><span class=rvts7>As you can see write througput is down to even 1MB/s for a moment. Sometimes it''s that low even for 2-4 seconds.</span></p> <p><br></p> <p><span class=rvts7>Reading from such created file is ok - I mean I get steady stream with about 160MB/s (single dd) or 240MB/s when zfetch_block_cap is increased to 256. </span></p> <p><br></p> <p><span class=rvts7>I''ve seen this quite often. </span></p> <p><span class=rvts7>My thought is this could just be data buffering. </span></p> <p><br></p> <p><span class=rvts7>When data is buffered in memory, the actual disk writes go down, when memory is filled up to a certain threshold, data are flushed out again. </span></p> <p><br></p> <p><span class=rvts7>Read causes a pagein immediately, if the data is not already in memory.</span></p> <p><br></p> <p><span class=rvts7>Sounds obvious, but I can''t think of an easy way to prove/disprove it.</span></p> <p><span class=rvts7>It may very well be something else, but I wouldn''t worry too much unless the _average_ write throught is lower than I expected. </span></p> <p><br></p> <p><span class=rvts7>The interesting question would be what the thresholds/algorithm are and what can be tuned.</span></p> <p><span class=rvts7>What if I don''t want to buffer the data or only buffer for small/random I/Os, something like the "</span><a class=rvts9 href="http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.doc/aixbman/prftungd/fsperf3.htm#j12a04n2001relbeh"> The release-behind mechanism</a><span class=rvts7>" option in AIX:</span></p> <p><a class=rvts9 href="http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.doc/aixbman/prftungd/fsperf3.htm">http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.doc/aixbman/prftungd/fsperf3.htm </a></p> <p><br></p> <p><br></p> <p><span class=rvts7>Tao</span></p> <p><br></p> </td> </tr> </table> </div> <p><br></p> <p>I don''t know - wrting from /dev/zero shold be fast enough to keep write cache saturated all the time.</p> <p><br></p> <p><br></p> <p><span class=rvts10>-- </span></p> <p><span class=rvts10>Best regards,</span></p> <p><span class=rvts10> Robert </span><a class=rvts11 href="mailto:rmilkowski@task.gda.pl">mailto:rmilkowski@task.gda.pl</a></p> <p><span class=rvts10> </span><a class=rvts11 href="http://milek.blogspot.com">http://milek.blogspot.com</a></p> </body></html>
bash-3.00# dtrace -n syscall::write:entry''/execname == "dd"/{self->t=timestamp;}'' -n syscall::write:return''/self->t/{@=quantize(timestamp-self->t);self->t=0;}'' -n tick-1s''{printa(@);}'' 0 43451 :tick-1s value ------------- Distribution ------------- count 32768 | 0 65536 | 1149 131072 |@@@@@@@@@@@@@@@@ 45126 262144 |@@ 6957 524288 |@@@@@@@@@@@@@@@@@@@@@@ 62123 1048576 | 1 2097152 | 0 4194304 | 0 8388608 | 0 16777216 | 0 33554432 | 0 67108864 | 0 134217728 | 0 268435456 | 0 536870912 | 0 1073741824 | 0 2147483648 | 1 4294967296 | 10 8589934592 | 0 ^C Everytime I saw a decrease in write throutput there was long write (those 11 above 2s). So I wrote small DTrace script which catches up write requests which lasted more than 4s and then in fbt:zfs probes longer than 1s. Here''s result: bash-3.00# ./long_write-3.d txg_wait_open 4088517163 zfs`zfs_write+0x308 genunix`write+0x268 genunix`dtrace_systrace_syscall32+0xa8 unix`syscall_trap32+0xcc libc.so.1`_write+0x8 dd`flsh+0x70 dd`main+0x11ac dd`_start+0x108 WRITE-return LONG 4090212496 txg_wait_open 4160249413 zfs`zfs_write+0x308 genunix`write+0x268 genunix`dtrace_systrace_syscall32+0xa8 unix`syscall_trap32+0xcc libc.so.1`_write+0x8 dd`flsh+0x70 dd`main+0x11ac dd`_start+0x108 WRITE-return LONG 4162075079 txg_wait_open 4085121246 zfs`zfs_write+0x308 genunix`write+0x268 genunix`dtrace_systrace_syscall32+0xa8 unix`syscall_trap32+0xcc libc.so.1`_write+0x8 dd`flsh+0x70 dd`main+0x11ac dd`_start+0x108 WRITE-return LONG 4086859663 ^C bash-3.00# So dd is hanging in write() syscall for more than 4s while write() most of the time spends in txg_wait_open(). vtimes for txg_wait_open are many orders of magnitude lower. Code for the function is: 383 txg_wait_open(dsl_pool_t *dp, uint64_t txg) 384 { 385 tx_state_t *tx = &dp->dp_tx; 386 387 mutex_enter(&tx->tx_sync_lock); 388 ASSERT(tx->tx_threads == 3); 389 if (txg == 0) 390 txg = tx->tx_open_txg + 1; 391 if (tx->tx_quiesce_txg_waiting < txg) 392 tx->tx_quiesce_txg_waiting = txg; 393 dprintf("txg=%llu quiesce_txg=%llu sync_txg=%llu\n", 394 txg, tx->tx_quiesce_txg_waiting, tx->tx_sync_txg_waiting); 395 while (tx->tx_open_txg < txg) { 396 cv_broadcast(&tx->tx_quiesce_more_cv); 397 cv_wait(&tx->tx_quiesce_done_cv, &tx->tx_sync_lock); 398 } 399 mutex_exit(&tx->tx_sync_lock); 400 } Any ideas? bash-3.00# cat long_write-3.d #!/usr/sbin/dtrace -qs #pragma D option nspec=100 syscall::write:entry / execname == "dd" / { self->t = timestamp; self->bs = arg2; } syscall::write:return / self->t && (timestamp - self->t) > 4000000000 / { printf("WRITE-return LONG\t%d\n\n", timestamp - self->t); self->t = 0; } syscall::write:return / self->t / { self->t = 0; self->spec = 0; } fbt:zfs::entry / self->t / { self->ft = timestamp; } fbt:zfs::return / self->ft && (timestamp - self->ft) > 1000000000/ { printf("%-20s %d\n", probefunc, timestamp - self->ft); stack(); ustack(); self->ft = 0; } bash-3.00# -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com