thr3ads.net - zfs discuss - [zfs-discuss] Large sequential writing problems [Mar 2006]

If this information is useful, please help other people find it:
Share via:

Robert Milkowski

2006-Mar-23 11:38 UTC

[zfs-discuss] Large sequential writing problems

Hi.

  Server is v240 with 2x 1.5GHz CPUs and 4GB of RAM. ZFS is based on b32.
There''re  8x 3511 SATA JBODS with 500GB disks (96 disks in total)
connected using two loops (one dual port fc card) and MPxIO. One large pool is
created (stripe) as:

zpool create large c4t22E4000A3300FC31d0
zpool add large c4t22E4000A3300FC32d0
zpool add large c4t22E4000A3300FC33d0
zpool add large c4t22E4000A3300FC34d0
zpool add large c4t22E4000A3300FC35d0
zpool add large c4t22E4000A3300FC36d0
zpool add large c4t22E4000A3300FC37d0
zpool add large c4t22E4000A3300FC38d0
zpool add large c4t22E4000A3300FC39d0
zpool add large c4t22E4000A3300FC3Ad0
zpool add large c4t22E4000A3300FC3Bd0
zpool add large c4t22E4000A3300FC3Cd0
zpool add large c4t22E4000A3300FC0Dd0
zpool add large c4t22E4000A3300FC0Ed0
zpool add large c4t22E4000A3300FC0Fd0
[...]


Now I''m doing:

bash-3.00# dd if=/dev/zero of=/large/q1 bs=128k
^C36553+0 records in
36553+0 records out
bash-3.00#

At the same time zpool iostat was running:

bash-3.00# zpool iostat 1
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
large       4.78M  43.5T      0    100  48.5K   843K
large       4.78M  43.5T      0      0      0      0
large       4.78M  43.5T      0  1.20K      0   153M
large        629M  43.5T      0  1.10K      0   138M
large        630M  43.5T      0     16      0   139K
large        630M  43.5T      0    495      0  61.9M
large        630M  43.5T      0  1.74K      0   222M
large        630M  43.5T      0  1.71K      0   220M
large        630M  43.5T      0  1.74K      0   222M
large       1.45G  43.5T      0    863      0   104M
large       1.45G  43.5T      0  1.16K      0   147M
large       1.45G  43.5T      0  1.79K      0   229M
large       1.45G  43.5T      0  1.86K      0   238M
large       2.22G  43.5T      0    364      0  42.1M
large       2.22G  43.5T      0    549      0  67.8M
large       2.22G  43.5T      0  1.73K      0   221M
large       2.22G  43.5T      0  1.72K      0   221M
large       2.22G  43.5T      0  1.80K      0   230M
large       3.09G  43.5T      0    425      0  49.0M
large       3.09G  43.5T      0  1.40K      0   179M
large       3.09G  43.5T      0  1.85K      0   237M
large       3.09G  43.5T      0  1.92K      0   245M
large       3.09G  43.5T      0  1.89K      0   242M
large       3.09G  43.5T      0  1.91K      0   244M
large       3.09G  43.5T      0  1.05K      0   134M
large       3.09G  43.5T      0     22      0  1.49M
large       4.38G  43.5T      8    767   559K  92.1M
large       4.47G  43.5T      0     43      0   676K
large       4.47G  43.5T      0      0      0      0
^C
bash-3.00#

As you can see write througput is down to even 1MB/s for a moment. Sometimes
it''s that low even for 2-4 seconds.

Reading from such created file is ok - I mean I get steady stream with about
160MB/s (single dd) or 240MB/s when zfetch_block_cap is increased to 256.

I don''t see any errors on FC links:

bash-3.00# fcinfo hba-port -l
HBA Port WWN: 210000e08b82cf2e
        OS Device Name: /dev/cfg/c2
        Manufacturer: QLogic Corp.
        Model: 375-3108-xx
        Type: L-port
        State: online
        Supported Speeds: 1Gb 2Gb
        Current Speed: 2Gb
        Node WWN: 200000e08b82cf2e
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 0
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 0
                Invalid CRC Count: 0
HBA Port WWN: 210100e08ba2cf2e
        OS Device Name: /dev/cfg/c3
        Manufacturer: QLogic Corp.
        Model: 375-3108-xx
        Type: L-port
        State: online
        Supported Speeds: 1Gb 2Gb
        Current Speed: 2Gb
        Node WWN: 200100e08ba2cf2e
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 0
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 0
                Invalid CRC Count: 0
bash-3.00#

Using iostat I can see that there''s queue to some disks, for example:
    0.0   21.0    0.0 2689.4  0.0 17.1    0.0  812.5   0 100
c4t22E4000A3300FC3Bd0
    0.0    4.0    0.0  512.3 18.0  1.0 4488.3  249.9 100 100
c4t22E4000A3300FB5Dd0
    0.0   21.0    0.0 2689.4  0.0 21.7    0.0 1030.4   0 100
c4t22E4000A3300FB7Dd0

But disks with such a bahaviour are different from second to second so I
don''t belive there''s a problem with disks.

Changing block size in dd command does not change that behaviour.

ps. SDR-0169 was created to cover this.
This message posted from opensolaris.org

Tao Chen

2006-Mar-23 21:08 UTC

head link

[zfs-discuss] Large sequential writing problems

On 3/23/06, Robert Milkowski <milek at task.gda.pl>
wrote:>
> Hi.
>
>   Server is v240 with 2x 1.5GHz CPUs and 4GB of RAM. ZFS is based on b32.
> There''re  8x 3511 SATA JBODS with 500GB disks (96 disks in total)
connected
> using two loops (one dual port fc card) and MPxIO. One large pool is
created
> (stripe) as:
>
> [...]
>
> As you can see write througput is down to even 1MB/s for a moment.
> Sometimes it''s that low even for 2-4 seconds.
>
> Reading from such created file is ok - I mean I get steady stream with
> about 160MB/s (single dd) or 240MB/s when zfetch_block_cap is increased to
> 256.
>
I''ve seen this quite often.
My thought is this could just be data buffering.

When data is buffered in memory, the actual disk writes go down, when memory
is filled up to a certain threshold, data are flushed out again.

Read causes a pagein immediately, if the data is not already in memory.

Sounds obvious, but I can''t think of an easy way to prove/disprove it.
It may very well be something else, but I wouldn''t worry too much
unless the
_average_ write throught is lower than I expected.

The interesting question would be what the thresholds/algorithm are and what
can be tuned.
What if I don''t want to buffer the data or only buffer for small/random
I/Os, something like the "The release-behind
mechanism<http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.doc/aixbman/prftungd/fsperf3.htm#j12a04n2001relbeh>"
option in AIX:
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.doc/aixbman/prftungd/fsperf3.htm

Tao
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060323/9404816f/attachment.html>

Robert Milkowski

2006-Mar-24 10:29 UTC

head link

[zfs-discuss] Large sequential writing problems

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><title></title>
<META http-equiv=Content-Type content="text/html;
charset=iso-8859-1">
<meta http-equiv="Content-Style-Type"
content="text/css">
<style type="text/css"><!--
body {
  margin: 5px 5px 5px 5px;
  background-color: #ffffff;
}
/* ---------- Text Styles ---------- */
hr { color: #000000}
body, table /* Normal text */
{
 font-size: 9pt;
 font-family: ''Courier New'';
 font-style: normal;
 font-weight: normal;
 color: #000000;
 text-decoration: none;
}
span.rvts1 /* Heading */
{
 font-size: 10pt;
 font-family: ''Arial'';
 font-weight: bold;
 color: #0000ff;
}
span.rvts2 /* Subheading */
{
 font-size: 10pt;
 font-family: ''Arial'';
 font-weight: bold;
 color: #000080;
}
span.rvts3 /* Keywords */
{
 font-size: 10pt;
 font-family: ''Arial'';
 font-style: italic;
 color: #800000;
}
a.rvts4, span.rvts4 /* Jump 1 */
{
 font-size: 10pt;
 font-family: ''Arial'';
 color: #008000;
 text-decoration: underline;
}
a.rvts5, span.rvts5 /* Jump 2 */
{
 font-size: 10pt;
 font-family: ''Arial'';
 color: #008000;
 text-decoration: underline;
}
span.rvts6
{
 font-family: ''tahoma'';
 font-weight: bold;
 color: #ffffff;
}
span.rvts7
{
 font-family: ''tahoma'';
}
span.rvts8
{
 font-family: ''tahoma'';
 font-weight: bold;
}
a.rvts9, span.rvts9
{
 font-family: ''tahoma'';
 color: #0000ff;
 text-decoration: underline;
}
span.rvts10
{
 font-size: 7pt;
 font-family: ''arial'';
 font-style: italic;
 color: #c0c0c0;
}
a.rvts11, span.rvts11
{
 font-size: 7pt;
 font-family: ''arial'';
 color: #0000ff;
 text-decoration: underline;
}
/* ---------- Para Styles ---------- */
p,ul,ol /* Paragraph Style */
{
 text-align: left;
 text-indent: 0px;
 padding: 0px 0px 0px 0px;
 margin: 0px 0px 0px 0px;
}
.rvps1 /* Centered */
{
 text-align: center;
}
--></style>
</head>
<body>

<p>Hello Tao,</p>
<p><br></p>
<p>Thursday, March 23, 2006, 10:08:42 PM, you wrote:</p>
<p><br></p>
<div><table border=0 cellpadding=1 cellspacing=2
style="border-color: #000000; border-style: solid;">
<tr valign=top>
<td width=12 style="background-color: #0000ff;">
<p><span class=rvts6>></span></p>
</td>
<td width=713 style="background-color: #ffffff;">
<p><br></p>
<p><br></p>
<p><span class=rvts7>On 3/23/06, </span><span
class=rvts8>Robert Milkowski</span><span class=rvts7>
<</span><a class=rvts9
href="mailto:milek@task.gda.pl">milek@task.gda.pl</a><span
class=rvts7>> wrote:</span></p>
<p><span class=rvts7>Hi.</span></p>
<p><br></p>
<p><span class=rvts7>  Server is v240 with 2x 1.5GHz CPUs and 4GB of
RAM. ZFS is based on b32. There''re  8x 3511 SATA JBODS with 500GB disks
(96 disks in total) connected using two loops (one dual port fc card) and MPxIO.
One large pool is created (stripe) as: </span></p>
<p><br></p>
<p><span class=rvts7>[...]</span></p>
<p><br></p>
<p><span class=rvts7>As you can see write througput is down to even
1MB/s for a moment. Sometimes it''s that low even for 2-4
seconds.</span></p>
<p><br></p>
<p><span class=rvts7>Reading from such created file is ok - I mean I
get steady stream with about 160MB/s (single dd) or 240MB/s when
zfetch_block_cap is increased to 256. </span></p>
<p><br></p>
<p><span class=rvts7>I''ve seen this quite often.
</span></p>
<p><span class=rvts7>My thought is this could just be data
buffering. </span></p>
<p><br></p>
<p><span class=rvts7>When data is buffered in memory, the actual
disk writes go down, when memory is filled up to a certain threshold, data are
flushed out again. </span></p>
<p><br></p>
<p><span class=rvts7>Read causes a pagein immediately, if the data
is not already in memory.</span></p>
<p><br></p>
<p><span class=rvts7>Sounds obvious, but I can''t think of
an easy way to prove/disprove it.</span></p>
<p><span class=rvts7>It may very well be something else, but I
wouldn''t worry too much unless the _average_ write throught is lower
than I expected. </span></p>
<p><br></p>
<p><span class=rvts7>The interesting question would be what the
thresholds/algorithm are and what can be tuned.</span></p>
<p><span class=rvts7>What if I don''t want to buffer the
data or only buffer for small/random I/Os, something like the
"</span><a class=rvts9
href="http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.doc/aixbman/prftungd/fsperf3.htm#j12a04n2001relbeh">
The release-behind mechanism</a><span class=rvts7>" option in
AIX:</span></p>
<p><a class=rvts9
href="http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.doc/aixbman/prftungd/fsperf3.htm">http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.doc/aixbman/prftungd/fsperf3.htm
</a></p>
<p><br></p>
<p><br></p>
<p><span class=rvts7>Tao</span></p>
<p><br></p>
</td>
</tr>
</table>
</div>
<p><br></p>
<p>I don''t know - wrting from /dev/zero shold be fast enough to
keep write cache saturated all the time.</p>
<p><br></p>
<p><br></p>
<p><span class=rvts10>-- </span></p>
<p><span class=rvts10>Best regards,</span></p>
<p><span class=rvts10> Robert                           
</span><a class=rvts11
href="mailto:rmilkowski@task.gda.pl">mailto:rmilkowski@task.gda.pl</a></p>
<p><span class=rvts10>                                      
</span><a class=rvts11
href="http://milek.blogspot.com">http://milek.blogspot.com</a></p>

</body></html>

Robert Milkowski

2006-Mar-27 14:06 UTC

head link

[zfs-discuss] Large sequential writing problems

bash-3.00# dtrace -n syscall::write:entry''/execname ==
"dd"/{self->t=timestamp;}'' -n
syscall::write:return''/self->t/{@=quantize(timestamp-self->t);self->t=0;}''
-n tick-1s''{printa(@);}''

  0  43451                         :tick-1s

           value  ------------- Distribution ------------- count
           32768 |                                         0
           65536 |                                         1149
          131072 |@@@@@@@@@@@@@@@@                         45126
          262144 |@@                                       6957
          524288 |@@@@@@@@@@@@@@@@@@@@@@                   62123
         1048576 |                                         1
         2097152 |                                         0
         4194304 |                                         0
         8388608 |                                         0
        16777216 |                                         0
        33554432 |                                         0
        67108864 |                                         0
       134217728 |                                         0
       268435456 |                                         0
       536870912 |                                         0
      1073741824 |                                         0
      2147483648 |                                         1
      4294967296 |                                         10
      8589934592 |                                         0


^C

Everytime I saw a decrease in write throutput there was long write (those 11
above 2s).


So I wrote small DTrace script which catches up write requests which lasted
more than 4s and then in fbt:zfs probes longer than 1s. Here''s result:

bash-3.00# ./long_write-3.d
txg_wait_open        4088517163

              zfs`zfs_write+0x308
              genunix`write+0x268
              genunix`dtrace_systrace_syscall32+0xa8
              unix`syscall_trap32+0xcc

              libc.so.1`_write+0x8
              dd`flsh+0x70
              dd`main+0x11ac
              dd`_start+0x108
WRITE-return LONG       4090212496

txg_wait_open        4160249413

              zfs`zfs_write+0x308
              genunix`write+0x268
              genunix`dtrace_systrace_syscall32+0xa8
              unix`syscall_trap32+0xcc

              libc.so.1`_write+0x8
              dd`flsh+0x70
              dd`main+0x11ac
              dd`_start+0x108
WRITE-return LONG       4162075079

txg_wait_open        4085121246

              zfs`zfs_write+0x308
              genunix`write+0x268
              genunix`dtrace_systrace_syscall32+0xa8
              unix`syscall_trap32+0xcc

              libc.so.1`_write+0x8
              dd`flsh+0x70
              dd`main+0x11ac
              dd`_start+0x108
WRITE-return LONG       4086859663

^C

bash-3.00#

So dd is hanging in write() syscall for more than 4s while write()
most of the time spends in txg_wait_open().
vtimes for txg_wait_open are many orders of magnitude lower.


Code for the function is:

    383 txg_wait_open(dsl_pool_t *dp, uint64_t txg)
    384 {
    385         tx_state_t *tx = &dp->dp_tx;
    386 
    387         mutex_enter(&tx->tx_sync_lock);
    388         ASSERT(tx->tx_threads == 3);
    389         if (txg == 0)
    390                 txg = tx->tx_open_txg + 1;
    391         if (tx->tx_quiesce_txg_waiting < txg)
    392                 tx->tx_quiesce_txg_waiting = txg;
    393         dprintf("txg=%llu quiesce_txg=%llu sync_txg=%llu\n",
    394             txg, tx->tx_quiesce_txg_waiting,
tx->tx_sync_txg_waiting);
    395         while (tx->tx_open_txg < txg) {
    396                 cv_broadcast(&tx->tx_quiesce_more_cv);
    397                 cv_wait(&tx->tx_quiesce_done_cv,
&tx->tx_sync_lock);
    398         }
    399         mutex_exit(&tx->tx_sync_lock);
    400 }



Any ideas?


bash-3.00# cat long_write-3.d
#!/usr/sbin/dtrace -qs

#pragma D option nspec=100


syscall::write:entry
/ execname == "dd" /
{
  self->t = timestamp;
  self->bs = arg2;
}

syscall::write:return
/ self->t && (timestamp - self->t) > 4000000000 /
{
  printf("WRITE-return LONG\t%d\n\n", timestamp - self->t);
  self->t = 0;
}


syscall::write:return
/ self->t /
{
  self->t = 0;
  self->spec = 0;
}

fbt:zfs::entry
/ self->t /
{
  self->ft = timestamp;
}

fbt:zfs::return
/ self->ft && (timestamp - self->ft) > 1000000000/
{
  printf("%-20s %d\n", probefunc, timestamp - self->ft);
  stack();
  ustack();
  self->ft = 0;
}

bash-3.00#



-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

zfs discuss - Mar 2006 - Large sequential writing problems

[zfs-discuss] Large sequential writing problems

[zfs-discuss] Large sequential writing problems

[zfs-discuss] Large sequential writing problems

[zfs-discuss] Large sequential writing problems