thr3ads.net - zfs discuss - [zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds [May 2010]

If this information is useful, please help other people find it:
Share via:

Sandon Van Ness

2010-May-30 22:04 UTC

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

Basically for a few seconds at a time I can get very nice speeds through
rsync (saturating a 1 gig link) which is around 112-113 megabytes/sec
which is about as good as I can expect after overhead. The problem is
that every 5 seconds when data is actually written to disks (physically
looking at the disk LEDs I see the activity on the one sending data
stall as the ZFS machine is showing disk activity (writes).

Basically this problem is bringing down the average write speed from
around 112-113 megabytes/sec to around 100 megabytes/sec (sometimes
lower) and thus lowering speeds by 10% (or a bit more). I only really
care as 10% can make a difference when you are copying terrabytes of
data to the machine. Anyway here is what I am seeing on the linux
machine that is sending to the ZFS machine:


Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdc3             17.00     0.00  496.00    0.00   112.00     0.00  
462.45     0.42    0.84   0.16   7.90

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.34    0.00    3.80    0.00    0.00   89.86

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdc3              1.00     0.00  584.00    0.00   111.75     0.00  
391.88     0.43    0.74   0.12   7.10

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.84    0.00    5.44    0.00    0.00   87.72

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdc3              4.00     0.00  557.00    0.00   112.00     0.00  
411.81     0.43    0.77   0.13   7.10

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.64    0.00    2.51    0.00    0.00   92.86

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdc3              1.98     0.00  104.95    0.00    23.76     0.00  
463.70     0.09    0.84   0.17   1.78

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.11    0.00    5.08    0.00    0.00   87.82

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdc3              9.00     0.00  538.00    0.00   112.69     0.00  
428.97     0.39    0.72   0.14   7.50

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.46    0.00    5.56    0.00    0.00   86.98

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdc3              8.00     0.00  524.00    0.00   112.00     0.00  
437.74     0.38    0.72   0.13   6.90

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.98    0.00    5.96    0.00    0.00   86.06

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdc3              1.00     0.00  493.00    0.00   111.29     0.00  
462.31     0.39    0.80   0.16   7.90


This is iostat -xm 1 sdc3

Basically you can see its reading at around full gig speed and then
reads drop down due to writes being stalled on the ZFS machine. These
are 1 second averages.

On the ZFS end with 10 second averages I see approximately 100 MB/sec:


data        25.1G  18.1T      0    834      0   100M
data        26.2G  18.1T      0    833      0   100M
data        27.3G  18.1T      0    833      0   100M


Change this to 1 second and I see:
data        32.7G  18.1T      0      0      0      0
data        32.7G  18.1T      0  2.86K      0   360M
data        33.3G  18.1T      0    264      0  21.8M
data        33.3G  18.1T      0      0      0      0
data        33.3G  18.1T      0      0      0      0
data        33.3G  18.1T      0      0      0      0
data        33.3G  18.1T      0  2.94K      0   369M
data        33.8G  18.1T      0    375      0  35.1M
data        33.8G  18.1T      0      0      0      0
data        33.8G  18.1T      0      0      0      0
data        33.8G  18.1T      0      0      0      0
data        33.8G  18.1T      0  2.90K      0   365M
data        34.4G  18.1T      0    599      0  62.6M
data        34.4G  18.1T      0      0      0      0
data        34.4G  18.1T      0      0      0      0
data        34.4G  18.1T      0      0      0      0
data        34.4G  18.1T      0  2.10K      0   265M
data        34.9G  18.1T      0  1.77K      0   211M


I tried changing the txg sync time from 30 to 1 and that did make things
more smooth but in general lowered speeds (down to 90 megabytes/sec or
so). Actually writing files to the array I see well excess of 112
megabytes/sec so I would think I should be able to get this to go at
full gig speeds without the small stalls:

root at opensolaris: 11:36 AM :/data# dd bs=1M count=100000 if=/dev/zero
of=./100gb.bin
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 233.257 s, 450 MB/s

Richard Elling

2010-May-30 23:22 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On May 30, 2010, at 3:04 PM, Sandon Van Ness wrote:
> Basically for a few seconds at a time I can get very nice speeds through
> rsync (saturating a 1 gig link) which is around 112-113 megabytes/sec
> which is about as good as I can expect after overhead. The problem is
> that every 5 seconds when data is actually written to disks (physically
> looking at the disk LEDs I see the activity on the one sending data
> stall as the ZFS machine is showing disk activity (writes).
> 
> Basically this problem is bringing down the average write speed from
> around 112-113 megabytes/sec to around 100 megabytes/sec (sometimes
> lower) and thus lowering speeds by 10% (or a bit more). I only really
> care as 10% can make a difference when you are copying terrabytes of
> data to the machine. Anyway here is what I am seeing on the linux
> machine that is sending to the ZFS machine:
> 
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdc3             17.00     0.00  496.00    0.00   112.00     0.00  
> 462.45     0.42    0.84   0.16   7.90
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           6.34    0.00    3.80    0.00    0.00   89.86
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdc3              1.00     0.00  584.00    0.00   111.75     0.00  
> 391.88     0.43    0.74   0.12   7.10
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           6.84    0.00    5.44    0.00    0.00   87.72
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdc3              4.00     0.00  557.00    0.00   112.00     0.00  
> 411.81     0.43    0.77   0.13   7.10
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           4.64    0.00    2.51    0.00    0.00   92.86
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdc3              1.98     0.00  104.95    0.00    23.76     0.00  
> 463.70     0.09    0.84   0.17   1.78
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           7.11    0.00    5.08    0.00    0.00   87.82
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdc3              9.00     0.00  538.00    0.00   112.69     0.00  
> 428.97     0.39    0.72   0.14   7.50
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           7.46    0.00    5.56    0.00    0.00   86.98
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdc3              8.00     0.00  524.00    0.00   112.00     0.00  
> 437.74     0.38    0.72   0.13   6.90
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           7.98    0.00    5.96    0.00    0.00   86.06
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdc3              1.00     0.00  493.00    0.00   111.29     0.00  
> 462.31     0.39    0.80   0.16   7.90
> 
> 
> This is iostat -xm 1 sdc3
> 
> Basically you can see its reading at around full gig speed and then
> reads drop down due to writes being stalled on the ZFS machine. These
> are 1 second averages.
> 
> On the ZFS end with 10 second averages I see approximately 100 MB/sec:
> 
> 
> data        25.1G  18.1T      0    834      0   100M
> data        26.2G  18.1T      0    833      0   100M
> data        27.3G  18.1T      0    833      0   100M
> 
> 
> Change this to 1 second and I see:
> data        32.7G  18.1T      0      0      0      0
> data        32.7G  18.1T      0  2.86K      0   360M
> data        33.3G  18.1T      0    264      0  21.8M
> data        33.3G  18.1T      0      0      0      0
> data        33.3G  18.1T      0      0      0      0
> data        33.3G  18.1T      0      0      0      0
> data        33.3G  18.1T      0  2.94K      0   369M
> data        33.8G  18.1T      0    375      0  35.1M
> data        33.8G  18.1T      0      0      0      0
> data        33.8G  18.1T      0      0      0      0
> data        33.8G  18.1T      0      0      0      0
> data        33.8G  18.1T      0  2.90K      0   365M
> data        34.4G  18.1T      0    599      0  62.6M
> data        34.4G  18.1T      0      0      0      0
> data        34.4G  18.1T      0      0      0      0
> data        34.4G  18.1T      0      0      0      0
> data        34.4G  18.1T      0  2.10K      0   265M
> data        34.9G  18.1T      0  1.77K      0   211M
> 
> 
> I tried changing the txg sync time from 30 to 1 and that did make things
> more smooth but in general lowered speeds (down to 90 megabytes/sec or
> so). Actually writing files to the array I see well excess of 112
> megabytes/sec so I would think I should be able to get this to go at
> full gig speeds without the small stalls:
I have better luck tuning the zfs_txg_synctime_ms from 5000 to 1000 or less.
> 
> root at opensolaris: 11:36 AM :/data# dd bs=1M count=100000 if=/dev/zero
> of=./100gb.bin
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 233.257 s, 450 MB/s

If you want to decouple the txg commit completely, then you might consider
using a buffer of some sort.  I use mbuffer for pipes, but that may be tricky
to use in an rsync environment.
  -- richard

-- 
Richard Elling
richard at nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/

Sandon Van Ness

2010-May-31 04:06 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On 05/30/2010 04:22 PM, Richard Elling wrote:> If you want to decouple the txg commit completely, then you might consider
> using a buffer of some sort.  I use mbuffer for pipes, but that may be
tricky
> to use in an rsync environment.
>   -- richard
>   I initially thought this was I/O but now I think it might be a CPU
bottleneck causing the problem.

When I write zero''s using DD I can get quite good speed (~500
megabytes/sec) but I think the problem is that it totally maxes both of
my CPU''s cores when I do so (2-5% idle in top).

The problem is that when it does the write burst its taking away CPU
usage from rsync which is actually what might be causing the dip during
writes (not the I/O activity itself) but the CPU generated from the writes.

I actually verified this by making a non raidz2 pool and just added all
the disks into a pool with nothing specified:

zpool create data c4t5000C500028BD5FCd0p0 c4t5000C50009A4D727d0p0
c4t5000C50009A46AF5d0p0 c4t5000C50009A515B0d0p0 c4t5000C500028A81BEd0p0
c4t5000C500028B44A1d0p0 c4t5000C500028B415Bd0p0 c4t5000C500028B23D2d0p0
c4t5000C5000CC3338Dd0p0 c4t5000C500027F59C8d0p0 c4t5000C50009DBF8D4d0p0
c4t5000C500027F3C1Fd0p0 c4t5000C5000DAF02F3d0p0 c4t5000C5000DA7ED4Ed0p0
c4t5000C5000DAEF990d0p0 c4t5000C5000DAEEF8Ed0p0 c4t5000C5000DAEB881d0p0
c4t5000C5000A121581d0p0 c4t5000C5000DAC848Fd0p0 c4t5000C50002770EE6d0p0

Once I did this I got a nice stable 115 megabyte/ssec over the network
so 11.5% better.

So the problem appears to be that when it goes to write data it will use
100% of its cpu power (even for x number of ms of time) which stalls the
network while it does this. This happens when it does parity
calculations but doesn''t when using a non parity zpool. I
don''t think I
can throttle parity calculations at all so I don''t think there will be
a
fix to this unfortunately =(. I can live with losing 10% off my rsync
speed though.

Bob Friesenhahn

2010-May-31 20:13 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On Sun, 30 May 2010, Sandon Van Ness wrote:>
> The problem is that when it does the write burst its taking away CPU
> usage from rsync which is actually what might be causing the dip during
> writes (not the I/O activity itself) but the CPU generated from the writes.
You didn''t say which Solaris you are using, but I think that there are 
recent updates to Solaris 10 and OpenSolaris which are supposed to 
solve this problem (zfs blocking access to CPU by applications).
>From Solaris 10 x86 (kernel 142901-09):
6586537 async zio taskqs can block out userland commands

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Sandon Van Ness

2010-May-31 20:32 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On 05/31/2010 01:13 PM, Bob Friesenhahn wrote:> On Sun, 30 May 2010, Sandon Van Ness wrote:
>>
>> The problem is that when it does the write burst its taking away CPU
>> usage from rsync which is actually what might be causing the dip during
>> writes (not the I/O activity itself) but the CPU generated from the
>> writes.
>
> You didn''t say which Solaris you are using, but I think that there
are
> recent updates to Solaris 10 and OpenSolaris which are supposed to
> solve this problem (zfs blocking access to CPU by applications).
>
> From Solaris 10 x86 (kernel 142901-09):
>
> 6586537 async zio taskqs can block out userland commands
>
> BobI am using opensolaris snv_134:

root at opensolaris: 01:32 PM :~# uname -a
SunOS opensolaris 5.11 snv_134 i86pc i386 i86pc

Is there a setting that to change the cpu scheduler for the ZFS process
or is it supposed to just be fixed out of the box? I figure if the
parity calculation process can be ''reniced'' or something this
problem
could likely be fixed as it ended up being CPU and not disk I/O like I
originally thought.

Bob Friesenhahn

2010-May-31 20:51 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On Mon, 31 May 2010, Sandon Van Ness wrote:>> 6586537 async zio taskqs can block out userland commands
>>
>> Bob
> I am using opensolaris snv_134:
>
> root at opensolaris: 01:32 PM :~# uname -a
> SunOS opensolaris 5.11 snv_134 i86pc i386 i86pc
>
> Is there a setting that to change the cpu scheduler for the ZFS process
> or is it supposed to just be fixed out of the box? I figure if the
> parity calculation process can be ''reniced'' or something
this problem
> could likely be fixed as it ended up being CPU and not disk I/O like I
> originally thought.
There are multiple factors at work.  Your OpenSolaris should be new 
enough to have the fix in which the zfs I/O tasks are run in in a 
scheduling class at lower priority than normal user processes. 
However, there is also a throttling mechanism for processes which 
produce data faster than can be consumed by the disks.  This 
throttling mechanism depends on the amount of RAM available to zfs and 
the write speed of the I/O channel.  More available RAM results in 
more write buffering, which results in a larger chunk of data written 
at the next transaction group write interval.  The maximum size of a 
transaction group may be configured in /etc/system similar to:

* Set ZFS maximum TXG group size to 2684354560
set zfs:zfs_write_limit_override = 0xa0000000

If the transaction group is smaller, then zfs will need to write more 
often.  Processes will still be throttled but the duration of the 
delay should be smaller due to less data to write in each burst.  I 
think that (with multiple writers) the zfs pool will be "healthier" 
and less fragmented if you can offer zfs more RAM and accept some 
stalls during writing.  There are always tradeoffs.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Sandon Van Ness

2010-May-31 21:32 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On 05/31/2010 01:51 PM, Bob Friesenhahn wrote:> There are multiple factors at work.  Your OpenSolaris should be new
> enough to have the fix in which the zfs I/O tasks are run in in a
> scheduling class at lower priority than normal user processes.
> However, there is also a throttling mechanism for processes which
> produce data faster than can be consumed by the disks.  This
> throttling mechanism depends on the amount of RAM available to zfs and
> the write speed of the I/O channel.  More available RAM results in
> more write buffering, which results in a larger chunk of data written
> at the next transaction group write interval.  The maximum size of a
> transaction group may be configured in /etc/system similar to:
>
> * Set ZFS maximum TXG group size to 2684354560
> set zfs:zfs_write_limit_override = 0xa0000000
>
> If the transaction group is smaller, then zfs will need to write more
> often.  Processes will still be throttled but the duration of the
> delay should be smaller due to less data to write in each burst.  I
> think that (with multiple writers) the zfs pool will be
"healthier"
> and less fragmented if you can offer zfs more RAM and accept some
> stalls during writing.  There are always tradeoffs.
>
> Bobwell it seems like when messing with the txg sync times and stuff like
that it did make the transfer more smooth but didn''t actually help with
speeds as it just meant the hangs happened for a shorter time but at a
smaller interval and actually lowering the time between writes just
seemed to make things worse (slightly).

I think I have came to the conclusion that the problem here is CPU due
to the fact that its only doing this with parity raid. I would think if
it was I/O based then it would be the same as if anything its heavier on
I/O on non parity raid due to the fact that it is no longer CPU
bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with
parity raidz2).

So if I am understnading things the issueI am seeing should be fixed but
aparrantly its not (in my case) as CPU usage from parity/zfs
calculations are takin g precidence over my process doing the writting
(rsync)?

I think I have near 100% come to the conclusion that the issue is CPU
based due the fact I saw the same dips even when using mbuffer.

Mike Gerdts

2010-May-31 21:52 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On Mon, May 31, 2010 at 4:32 PM, Sandon Van Ness <sandon at van-ness.com>
wrote:> On 05/31/2010 01:51 PM, Bob Friesenhahn wrote:
>> There are multiple factors at work. ?Your OpenSolaris should be new
>> enough to have the fix in which the zfs I/O tasks are run in in a
>> scheduling class at lower priority than normal user processes.
>> However, there is also a throttling mechanism for processes which
>> produce data faster than can be consumed by the disks. ?This
>> throttling mechanism depends on the amount of RAM available to zfs and
>> the write speed of the I/O channel. ?More available RAM results in
>> more write buffering, which results in a larger chunk of data written
>> at the next transaction group write interval. ?The maximum size of a
>> transaction group may be configured in /etc/system similar to:
>>
>> * Set ZFS maximum TXG group size to 2684354560
>> set zfs:zfs_write_limit_override = 0xa0000000
>>
>> If the transaction group is smaller, then zfs will need to write more
>> often. ?Processes will still be throttled but the duration of the
>> delay should be smaller due to less data to write in each burst. ?I
>> think that (with multiple writers) the zfs pool will be
"healthier"
>> and less fragmented if you can offer zfs more RAM and accept some
>> stalls during writing. ?There are always tradeoffs.
>>
>> Bob
> well it seems like when messing with the txg sync times and stuff like
> that it did make the transfer more smooth but didn''t actually help
with
> speeds as it just meant the hangs happened for a shorter time but at a
> smaller interval and actually lowering the time between writes just
> seemed to make things worse (slightly).
>
> I think I have came to the conclusion that the problem here is CPU due
> to the fact that its only doing this with parity raid. I would think if
> it was I/O based then it would be the same as if anything its heavier on
> I/O on non parity raid due to the fact that it is no longer CPU
> bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with
> parity raidz2).
To see if the CPU is pegged, take a look at the output of:

mpstat 1
prstat -mLc 1

If mpstat shows that the idle time reaches 0 or the process'' latency
column is more then a few tenths of a percent, you are probably short
on CPU.

It could also be that interrupts are stealing cycles from rsync.
Placing it in a processor set with interrupts disabled in that
processor set may help.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Sandon Van Ness

2010-May-31 21:53 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On 05/31/2010 02:32 PM, Sandon Van Ness wrote:> well it seems like when messing with the txg sync times and stuff like
> that it did make the transfer more smooth but didn''t actually help
with
> speeds as it just meant the hangs happened for a shorter time but at a
> smaller interval and actually lowering the time between writes just
> seemed to make things worse (slightly).
>
> I think I have came to the conclusion that the problem here is CPU due
> to the fact that its only doing this with parity raid. I would think if
> it was I/O based then it would be the same as if anything its heavier on
> I/O on non parity raid due to the fact that it is no longer CPU
> bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with
> parity raidz2).
>
> So if I am understnading things the issueI am seeing should be fixed but
> aparrantly its not (in my case) as CPU usage from parity/zfs
> calculations are takin g precidence over my process doing the writting
> (rsync)?
>
> I think I have near 100% come to the conclusion that the issue is CPU
> based due the fact I saw the same dips even when using mbuffer
And here is some top output the slowdowns occur when zfs-pool starts
using cpu and rsnc gets CPU starved:

Normal activity shows:

last pid: 22635;  load avg:  2.17,  2.18,  2.16;  up
0+18:04:42
14:53:29
59 processes: 57 sleeping, 1 running, 1 on cpu
CPU states: 54.7% idle, 23.4% user, 21.9% kernel,  0.0% iowait,  0.0% swap
Kernel: 37646 ctxsw, 193 trap, 20914 intr, 45295 syscall
Memory: 4027M phys mem, 190M free mem, 2013M total swap, 2013M free swap

   PID USERNAME NLWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
  1326 root        1  59  -20  383M   44M run    496:47 28.87% rsync
  1322 root        1  59  -20  383M  357M sleep   11:21  0.70% rsync
     3 root        1  60  -20    0K    0K sleep    1:24  0.06% fsflush

when starved:

last pid: 22636;  load avg:  2.16,  2.18,  2.16;  up
0+18:05:16
14:54:03
59 processes: 57 sleeping, 2 on cpu
CPU states: 24.9% idle, 10.5% user, 64.6% kernel,  0.0% iowait,  0.0% swap
Kernel: 17855 ctxsw, 18 trap, 12831 intr, 21090 syscall
Memory: 4027M phys mem, 198M free mem, 2013M total swap, 2013M free swap

   PID USERNAME NLWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
   604 root       39  99  -20    0K    0K cpu/0  316:55 53.36% zpool-data
  1326 root        1  59  -20  383M   44M sleep  497:03 13.49% rsync
  1322 root        1  59  -20  383M  357M sleep   11:21  0.33% rsync
 22635 root        1  59    0 3852K 1912K cpu/1    0:00  0.06% top
     3 root        1  60  -20    0K    0K sleep    1:24  0.06% fsflush

The stall actually happens less than a second but the solaris version of
top doesn''t seem to be able to take <1 values (other than 0) when
using
-s like you can on linux (-d .5) otherwise I think the zpool-data would
be near 100% cpu during the stall.

Sandon Van Ness

2010-May-31 21:58 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On 05/31/2010 02:52 PM, Mike Gerdts wrote:> On Mon, May 31, 2010 at 4:32 PM, Sandon Van Ness <sandon at
van-ness.com> wrote:
>   
>> On 05/31/2010 01:51 PM, Bob Friesenhahn wrote:
>>     
>>> There are multiple factors at work.  Your OpenSolaris should be new
>>> enough to have the fix in which the zfs I/O tasks are run in in a
>>> scheduling class at lower priority than normal user processes.
>>> However, there is also a throttling mechanism for processes which
>>> produce data faster than can be consumed by the disks.  This
>>> throttling mechanism depends on the amount of RAM available to zfs
and
>>> the write speed of the I/O channel.  More available RAM results in
>>> more write buffering, which results in a larger chunk of data
written
>>> at the next transaction group write interval.  The maximum size of
a
>>> transaction group may be configured in /etc/system similar to:
>>>
>>> * Set ZFS maximum TXG group size to 2684354560
>>> set zfs:zfs_write_limit_override = 0xa0000000
>>>
>>> If the transaction group is smaller, then zfs will need to write
more
>>> often.  Processes will still be throttled but the duration of the
>>> delay should be smaller due to less data to write in each burst.  I
>>> think that (with multiple writers) the zfs pool will be
"healthier"
>>> and less fragmented if you can offer zfs more RAM and accept some
>>> stalls during writing.  There are always tradeoffs.
>>>
>>> Bob
>>>       
>> well it seems like when messing with the txg sync times and stuff like
>> that it did make the transfer more smooth but didn''t actually
help with
>> speeds as it just meant the hangs happened for a shorter time but at a
>> smaller interval and actually lowering the time between writes just
>> seemed to make things worse (slightly).
>>
>> I think I have came to the conclusion that the problem here is CPU due
>> to the fact that its only doing this with parity raid. I would think if
>> it was I/O based then it would be the same as if anything its heavier
on
>> I/O on non parity raid due to the fact that it is no longer CPU
>> bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with
>> parity raidz2).
>>     
> To see if the CPU is pegged, take a look at the output of:
>
> mpstat 1
> prstat -mLc 1
>
> If mpstat shows that the idle time reaches 0 or the process''
latency
> column is more then a few tenths of a percent, you are probably short
> on CPU.
>
> It could also be that interrupts are stealing cycles from rsync.
> Placing it in a processor set with interrupts disabled in that
> processor set may help.
>
>   
Unfortunately none of these utilies make it possible to ge values for <1
second which is what the hang is (its happening for about 1/2 of a second).

Here is with mpstat:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0 20229 20029 23175   25  222  972    0   322    0  31  
0  69
  1    0   0    0    43    9 14926   10  221  909    0 44606   47  14  
0  40
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0 20230 20029 23618   27  210  914    0   492    0  31  
0  69
  1    0   0    0    50    6 14614   12  215  875    0 45930   47  14  
0  39
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0 11551 11335 12715   39  154  702    0   385    0  62  
0  38
  1    0   0    0  2457 2379 8545   35  150  739    0 24349   25  53   0  22
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0   14 12804 12591 14333  243  204  610    0   794    1  20  
0  79
  1    0   0    4   977  863 9561   11  210  559    0 24590   27  10   0  63
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0 20239 20035 22932   46  224  996    0   537    0  31  
0  69
  1    0   0    0    66    7 15498   12  227  930    0 44808   46  14  
0  39

Here is what i get with prstat:

Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
PROCESS/LWPID
  1326 root      47  11 0.0 0.0 0.0 0.0  40 1.6  6K  29 45K   0 rsync/1
  1322 root     0.0 1.4 0.0 0.0 0.0 0.0  93 5.4  6K   0   0   0 rsync/1
 22644 root     0.0 0.1 0.0 0.0 0.0 0.0 100 0.0  16   9 255   0 prstat/1
   196 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  15   0 105   0 nscd/17
 14409 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   2   3  31   0 sshd/1
   446 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   1   0   1   0 fmd/24
   196 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   1   0   1   0 nscd/13
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/15
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/14
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/13
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/12
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/11
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/10
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/9
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/8
Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
PROCESS/LWPID
   604 root     0.0  33 0.0 0.0 0.0 0.0  42  25  18  13   0   0
zpool-data/13
   604 root     0.0  30 0.0 0.0 0.0 0.0  41  29  12  12   0   0
zpool-data/15
  1326 root      12 2.9 0.0 0.0 0.0 0.0  85 0.4  1K  12 11K   0 rsync/1
   604 root     0.0  15 0.0 0.0 0.0 0.0  41  44 111   9   0   0
zpool-data/27
   604 root     0.0  14 0.0 0.0 0.0 0.0  43  42  72   3   0   0
zpool-data/33
   604 root     0.0 5.9 0.0 0.0 0.0 0.0  41  53 109   6   0   0
zpool-data/19
   604 root     0.0 5.4 0.0 0.0 0.0 0.0  42  53 106   8   0   0
zpool-data/25
   604 root     0.0 5.3 0.0 0.0 0.0 0.0  43  51 107   7   0   0
zpool-data/21
   604 root     0.0 4.5 0.0 0.0 0.0 0.0  41  54 110   4   0   0
zpool-data/31
   604 root     0.0 3.9 0.0 0.0 0.0 0.0  41  55 109   3   0   0
zpool-data/23
   604 root     0.0 3.7 0.0 0.0 0.0 0.0  44  52 111   2   0   0
zpool-data/29
  1322 root     0.0 0.4 0.0 0.0 0.0 0.0  98 2.0  1K   0   1   0 rsync/1
 22644 root     0.0 0.2 0.0 0.0 0.0 0.0 100 0.0  16  13 255   0 prstat/1
 14409 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   5   3  69   0 sshd/1
   196 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  15   2 105   0 nscd/17
Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
PROCESS/LWPID
  1326 root      44  11 0.0 0.0 0.0 0.0  44 1.3  5K  24 42K   0 rsync/1
  1322 root     0.0 1.4 0.0 0.0 0.0 0.0  93 5.9  5K   0   0   0 rsync/1
 22644 root     0.0 0.2 0.0 0.0 0.0 0.0 100 0.0  16   0 255   0 prstat/1
   604 root     0.0 0.1 0.0 0.0 0.0 0.0 100 0.0   1   0   0   0 zpool-data/2
 14409 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  14   1 133   0 sshd/1
   196 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  15   0 105   0 nscd/17
   604 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  62   0   0   0 zpool-data/3
   604 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  28   0   0   0
zpool-data/39
   196 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   1   1   5   0 nscd/7
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/15
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/14
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/13
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/12
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/11
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/10
Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
PROCESS/LWPID
  1326 root      47  11 0.0 0.0 0.0 0.0  40 1.5  6K  34 45K   0 rsync/1
  1322 root     0.0 1.4 0.0 0.0 0.0 0.0  93 5.8  6K   0   0   0 rsync/1
 22644 root     0.0 0.1 0.0 0.0 0.0 0.0 100 0.0  16   1 255   0 prstat/1
   196 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  15   0 105   0 nscd/17
 14409 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   2   3  31   0 sshd/1
   178 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   1   0   2   0 devfsadm/3
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/15
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/14
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/13
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/12
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/11
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/10
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/9
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/8
     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
zpool-rpool/7
Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15

Sandon Van Ness

2010-May-31 22:33 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On 05/31/2010 01:13 PM, Bob Friesenhahn wrote:> On Sun, 30 May 2010, Sandon Van Ness wrote:
>>
>> The problem is that when it does the write burst its taking away CPU
>> usage from rsync which is actually what might be causing the dip during
>> writes (not the I/O activity itself) but the CPU generated from the
>> writes.
>
> You didn''t say which Solaris you are using, but I think that there
are
> recent updates to Solaris 10 and OpenSolaris which are supposed to
> solve this problem (zfs blocking access to CPU by applications).
>
> From Solaris 10 x86 (kernel 142901-09):
>
> 6586537 async zio taskqs can block out userland commands
>
> Bob
Btw I tried the work around in that bug:

http://bugs.opensolaris.org/view_bug.do?bug_id=6586537

However I get:

zio_taskq_threads is not defined in ''zfs'' module

when I make the change to /etc/system

also I get:

root at opensolaris: 03:20 PM :/etc# echo zio_taskq_threads/D | mdb
mdb: failed to dereference symbol: unknown symbol name

also i can see its running a ton of threads from prstat from earlier.
Anyone know a way I can reduce it to one thread as I think that work
around will work for me.

Mike Gerdts

2010-May-31 22:34 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

Sorry, turned on html mode to avoid gmail''s line wrapping.

On Mon, May 31, 2010 at 4:58 PM, Sandon Van Ness <sandon at
van-ness.com>wrote:
> On 05/31/2010 02:52 PM, Mike Gerdts wrote:
> > On Mon, May 31, 2010 at 4:32 PM, Sandon Van Ness <sandon at
van-ness.com>
> wrote:
> >
> >> On 05/31/2010 01:51 PM, Bob Friesenhahn wrote:
> >>
> >>> There are multiple factors at work.  Your OpenSolaris should
be new
> >>> enough to have the fix in which the zfs I/O tasks are run in
in a
> >>> scheduling class at lower priority than normal user processes.
> >>> However, there is also a throttling mechanism for processes
which
> >>> produce data faster than can be consumed by the disks.  This
> >>> throttling mechanism depends on the amount of RAM available to
zfs and
> >>> the write speed of the I/O channel.  More available RAM
results in
> >>> more write buffering, which results in a larger chunk of data
written
> >>> at the next transaction group write interval.  The maximum
size of a
> >>> transaction group may be configured in /etc/system similar to:
> >>>
> >>> * Set ZFS maximum TXG group size to 2684354560
> >>> set zfs:zfs_write_limit_override = 0xa0000000
> >>>
> >>> If the transaction group is smaller, then zfs will need to
write more
> >>> often.  Processes will still be throttled but the duration of
the
> >>> delay should be smaller due to less data to write in each
burst.  I
> >>> think that (with multiple writers) the zfs pool will be
"healthier"
> >>> and less fragmented if you can offer zfs more RAM and accept
some
> >>> stalls during writing.  There are always tradeoffs.
> >>>
> >>> Bob
> >>>
> >> well it seems like when messing with the txg sync times and stuff
like
> >> that it did make the transfer more smooth but didn''t
actually help with
> >> speeds as it just meant the hangs happened for a shorter time but
at a
> >> smaller interval and actually lowering the time between writes
just
> >> seemed to make things worse (slightly).
> >>
> >> I think I have came to the conclusion that the problem here is CPU
due
> >> to the fact that its only doing this with parity raid. I would
think if
> >> it was I/O based then it would be the same as if anything its
heavier on
> >> I/O on non parity raid due to the fact that it is no longer CPU
> >> bottlenecked (dd write test gives me near 700 megabytes/sec vs 450
with
> >> parity raidz2).
> >>
> > To see if the CPU is pegged, take a look at the output of:
> >
> > mpstat 1
> > prstat -mLc 1
> >
> > If mpstat shows that the idle time reaches 0 or the process''
latency
> > column is more then a few tenths of a percent, you are probably short
> > on CPU.
> >
> > It could also be that interrupts are stealing cycles from rsync.
> > Placing it in a processor set with interrupts disabled in that
> > processor set may help.
> >
> >
>
> Unfortunately none of these utilies make it possible to ge values for <1
> second which is what the hang is (its happening for about 1/2 of a second).
>
> Here is with mpstat:
>
<snip - bad line wrapped lines removed>

>
> Here is what i get with prstat:
>
<snip - removed first interval & fixed formatting of next>

> Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15
>   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
> PROCESS/LWPID
>   604 root     0.0  33 0.0 0.0 0.0 0.0  42  25  18  13   0   0
> zpool-data/13
>   604 root     0.0  30 0.0 0.0 0.0 0.0  41  29  12  12   0   0
> zpool-data/15
>  1326 root      12 2.9 0.0 0.0 0.0 0.0  85 0.4  1K  12 11K   0 rsync/1
>   604 root     0.0  15 0.0 0.0 0.0 0.0  41  44 111   9   0   0
> zpool-data/27
>   604 root     0.0  14 0.0 0.0 0.0 0.0  43  42  72   3   0   0
> zpool-data/33
>   604 root     0.0 5.9 0.0 0.0 0.0 0.0  41  53 109   6   0   0
> zpool-data/19
>   604 root     0.0 5.4 0.0 0.0 0.0 0.0  42  53 106   8   0   0
> zpool-data/25
>   604 root     0.0 5.3 0.0 0.0 0.0 0.0  43  51 107   7   0   0
> zpool-data/21
>   604 root     0.0 4.5 0.0 0.0 0.0 0.0  41  54 110   4   0   0
> zpool-data/31
>   604 root     0.0 3.9 0.0 0.0 0.0 0.0  41  55 109   3   0   0
> zpool-data/23
>   604 root     0.0 3.7 0.0 0.0 0.0 0.0  44  52 111   2   0   0
> zpool-data/29
>  1322 root     0.0 0.4 0.0 0.0 0.0 0.0  98 2.0  1K   0   1   0 rsync/1
>  22644 root     0.0 0.2 0.0 0.0 0.0 0.0 100 0.0  16  13 255   0 prstat/1
>  14409 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   5   3  69   0 sshd/1
>   196 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  15   2 105   0 nscd/17
>
In the interval above, it looks as though rsync is getting spending a very
small amount of time waiting for a CPU (LAT or latency), but various zfs
threads spent quite a bit of time (relative to the interval) waiting for a
CPU.  Compared to the next interval, rsync spent very little time on cpu
(usr + sys): 14.9% vs. 55%, 58%, etc.).  Perhaps it is not being fed data
quickly enough because of CPU contention that prevents timely transferring
of data from the NIC to rsync.


Total: 57 processes, 260 lwps, load averages: 2.15, 2.16,
2.15>   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
> PROCESS/LWPID
>  1326 root      44  11 0.0 0.0 0.0 0.0  44 1.3  5K  24 42K   0 rsync/1
>  1322 root     0.0 1.4 0.0 0.0 0.0 0.0  93 5.9  5K   0   0   0 rsync/1
>  22644 root     0.0 0.2 0.0 0.0 0.0 0.0 100 0.0  16   0 255   0 prstat/1
>   604 root     0.0 0.1 0.0 0.0 0.0 0.0 100 0.0   1   0   0   0 zpool-data/2
>  14409 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  14   1 133   0 sshd/1
>   196 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  15   0 105   0 nscd/17
>   604 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  62   0   0   0 zpool-data/3
>   604 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  28   0   0   0
> zpool-data/39
>   196 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   1   1   5   0 nscd/7
>     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
> zpool-rpool/15
>     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
> zpool-rpool/14
>     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
> zpool-rpool/13
>     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
> zpool-rpool/12
>     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
> zpool-rpool/11
>     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
> zpool-rpool/10
> Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15
>   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
> PROCESS/LWPID
>  1326 root      47  11 0.0 0.0 0.0 0.0  40 1.5  6K  34 45K   0 rsync/1
>  1322 root     0.0 1.4 0.0 0.0 0.0 0.0  93 5.8  6K   0   0   0 rsync/1
>  22644 root     0.0 0.1 0.0 0.0 0.0 0.0 100 0.0  16   1 255   0 prstat/1
>   196 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  15   0 105   0 nscd/17
>  14409 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   2   3  31   0 sshd/1
>   178 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   1   0   2   0 devfsadm/3
>     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
> zpool-rpool/15
>     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
> zpool-rpool/14
>     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
> zpool-rpool/13
>     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
> zpool-rpool/12
>     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
> zpool-rpool/11
>     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
> zpool-rpool/10
>     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
> zpool-rpool/9
>     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
> zpool-rpool/8
>     5 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0
> zpool-rpool/7
> Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15
>
>

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100531/14910034/attachment.html>

Bob Friesenhahn

2010-May-31 23:45 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On Mon, 31 May 2010, Sandon Van Ness wrote:>
> I think I have came to the conclusion that the problem here is CPU due
> to the fact that its only doing this with parity raid. I would think if
> it was I/O based then it would be the same as if anything its heavier on
> I/O on non parity raid due to the fact that it is no longer CPU
> bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with
> parity raidz2).
The "parity RAID" certainly does impose more computational overhead, 
but not because of the parity calcuation.  You should put that out of 
your mind right away.  With raidz, each 128K block is chopped into 
smaller chunks which are written across the disks in the vdev.  This 
is less efficient (in many ways, but least of which is "parity") than 
writing 128K blocks to each disk in turn.  You are creating a blast of 
smaller I/Os to the various disks which may seem like more CPU but 
could be related to PCI-E access, interrupts, or a controller 
bottleneck.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Sandon Van Ness

2010-Jun-01 04:27 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On 05/31/2010 04:45 PM, Bob Friesenhahn wrote:> On Mon, 31 May 2010, Sandon Van Ness wrote:
>>
>> I think I have came to the conclusion that the problem here is CPU due
>> to the fact that its only doing this with parity raid. I would think if
>> it was I/O based then it would be the same as if anything its heavier
on
>> I/O on non parity raid due to the fact that it is no longer CPU
>> bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with
>> parity raidz2).
>
> The "parity RAID" certainly does impose more computational
overhead,
> but not because of the parity calcuation.  You should put that out of
> your mind right away.  With raidz, each 128K block is chopped into
> smaller chunks which are written across the disks in the vdev.  This
> is less efficient (in many ways, but least of which is "parity")
than
> writing 128K blocks to each disk in turn.  You are creating a blast of
> smaller I/Os to the various disks which may seem like more CPU but
> could be related to PCI-E access, interrupts, or a controller bottleneck.
>
> Bob
With sequential writes I don''t see how parity writing would be any
different from when I just created a 20 disk zpool which is doing the
same writes every 5 seconds but the only difference is it isn''t maxing
out CPU usage when doing the writes and and I don''t see the transfer
stall during the writes like I did on raidz2.

Bob Friesenhahn

2010-Jun-01 14:57 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On Mon, 31 May 2010, Sandon Van Ness wrote:> With sequential writes I don''t see how parity writing would be any
> different from when I just created a 20 disk zpool which is doing the
> same writes every 5 seconds but the only difference is it isn''t
maxing
> out CPU usage when doing the writes and and I don''t see the
transfer
> stall during the writes like I did on raidz2.
I am not understanding the above paragraph, but hopefully you agree 
that raidz2 issues many more writes (based on vdev stripe width) to 
the underlying disks than a simple non-redundant load-shared pool 
does.  Depending on your system, this might not be an issue, but it is 
possible that there is an I/O threshold beyond which something 
(probably hardware) causes a performance issue.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Sandon Van Ness

2010-Jun-04 19:20 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On 06/01/2010 07:57 AM, Bob Friesenhahn wrote:> On Mon, 31 May 2010, Sandon Van Ness wrote:
>> With sequential writes I don''t see how parity writing would be
any
>> different from when I just created a 20 disk zpool which is doing the
>> same writes every 5 seconds but the only difference is it
isn''t maxing
>> out CPU usage when doing the writes and and I don''t see the
transfer
>> stall during the writes like I did on raidz2.
>
> I am not understanding the above paragraph, but hopefully you agree
> that raidz2 issues many more writes (based on vdev stripe width) to
> the underlying disks than a simple non-redundant load-shared pool
> does.  Depending on your system, this might not be an issue, but it is
> possible that there is an I/O threshold beyond which something
> (probably hardware) causes a performance issue.
>
> Bob
Interesting enough when I went to copy the data back I got even worse
download speeds than I did write speeds! It looks like i need some sort
of read-ahead as unlike the writes it doesn''t appear to be CPU bound as
using mbuffer/tar gives me full gigabit speeds. You can see in my graph
here:

http://uverse.houkouonchi.jp/stats/netusage/1.1.1.3_2.html

On the weekly graph is when I was sending to the ZFS server and then
daily is showing it comming back but I stopped it  and shut down the
computer for a while which is the low speed flat line and then started
it up again this time using mbuffer and speeds are great. I don''t see
why I am having a trouble getting full speeds when doing reads unless it
needs to read ahead more than it is.

I decided to go ahead and to tar + mbuffer for the first pass and then
run rsync after for the final sync just to make sure nothing was missed.

Bob Friesenhahn

2010-Jun-05 01:15 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On Fri, 4 Jun 2010, Sandon Van Ness wrote:>
> Interesting enough when I went to copy the data back I got even worse
> download speeds than I did write speeds! It looks like i need some sort
> of read-ahead as unlike the writes it doesn''t appear to be CPU
bound as
> using mbuffer/tar gives me full gigabit speeds. You can see in my graph
> here:
>
> http://uverse.houkouonchi.jp/stats/netusage/1.1.1.3_2.html
I am still not sure what you are doing, however, it should not 
surprise that gigabit ethernet is limited to one gigabit of traffic 
(1000 Mb/s) in either direction.  Theoretically you should be able to 
get a gigabit of traffic in both directions at once, but this depends 
on the quality of your ethernet switch, ethernet adaptor card, device 
driver, and capabilities of where the data is read and written to.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Sandon Van Ness

2010-Jun-05 03:53 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On 06/04/2010 06:15 PM, Bob Friesenhahn wrote:> On Fri, 4 Jun 2010, Sandon Van Ness wrote:
>>
>> Interesting enough when I went to copy the data back I got even worse
>> download speeds than I did write speeds! It looks like i need some sort
>> of read-ahead as unlike the writes it doesn''t appear to be CPU
bound as
>> using mbuffer/tar gives me full gigabit speeds. You can see in my graph
>> here:
>>
>> http://uverse.houkouonchi.jp/stats/netusage/1.1.1.3_2.html
>
> I am still not sure what you are doing, however, it should not
> surprise that gigabit ethernet is limited to one gigabit of traffic
> (1000 Mb/s) in either direction.  Theoretically you should be able to
> get a gigabit of traffic in both directions at once, but this depends
> on the quality of your ethernet switch, ethernet adaptor card, device
> driver, and capabilities of where the data is read and written to.
>
> Bob
The problem is that just using rsync I am not getting gigabit. For me
gigabit maxes out at around 930-940 megabits. When I use rsync alone I
only was getting around 720 megabits incomming. This is only when its
reading from the block device. When reading from the memory (IE: cat a
few big files on the server to have them cached) it gets ~935 megabits.
The machine is easily able to sustain that read speed (and write) but
the problem is getting it to actually do it.

The only way I was able to get full gig (935 megabits) was using tar and
mbuffer due to it acting as a read-ahead buffer. is there anyway to turn
the prefetch up as there really is no reason I should only be getting
720 megabits when copying files off with rsync (or NFS) like I am seeing.

Bob Friesenhahn

2010-Jun-05 20:08 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On Fri, 4 Jun 2010, Sandon Van Ness wrote:>
> The problem is that just using rsync I am not getting gigabit. For me
> gigabit maxes out at around 930-940 megabits. When I use rsync alone I
> only was getting around 720 megabits incomming. This is only when its
> reading from the block device. When reading from the memory (IE: cat a
> few big files on the server to have them cached) it gets ~935 megabits.
> The machine is easily able to sustain that read speed (and write) but
> the problem is getting it to actually do it.
TCP imposes some overhead.  Rsync chats back and forth so additional 
latency is added.  Depending on settings, rsync may read an existing 
block on the receiving end and only send an update block if the data 
is different.
> The only way I was able to get full gig (935 megabits) was using tar and
> mbuffer due to it acting as a read-ahead buffer. is there anyway to turn
> the prefetch up as there really is no reason I should only be getting
> 720 megabits when copying files off with rsync (or NFS) like I am seeing.
While there have been some unfortunate bugs in the prefetch algorithm, 
a problem when sending many smaller files is that it takes a bit of 
time for the prefetch to ramp up for each individual file. Zfs needs 
to learn that prefetch is valuable for the file.  For obvious reasons, 
zfs does not assume maximum prefetch immediately after the file has 
been opened.  For a while now I have argued that zfs should be able to 
learn filesystem and process behavior based on recent activity and 
should dynamically tune prefech based on that knowledge.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Sandon Van Ness

2010-Jun-05 20:23 UTC

head link

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

On 06/05/2010 01:08 PM, Bob Friesenhahn wrote:
> > On Fri, 4 Jun 2010, Sandon Van Ness wrote:
>   
>> >>
>> >> The problem is that just using rsync I am not getting gigabit.
For me
>> >> gigabit maxes out at around 930-940 megabits. When I use rsync
alone I
>> >> only was getting around 720 megabits incomming. This is only
when its
>> >> reading from the block device. When reading from the memory
(IE: cat a
>> >> few big files on the server to have them cached) it gets ~935
megabits.
>> >> The machine is easily able to sustain that read speed (and
write) but
>> >> the problem is getting it to actually do it.
>>     
> >
> > TCP imposes some overhead.  Rsync chats back and forth so additional
> > latency is added.  Depending on settings, rsync may read an existing
> > block on the receiving end and only send an update block if the data
> > is different.
>   Its not TCP overhead and its not rsync. I see identical behavior when
reading off NFS as well. Basically if the file is read on the server
itself IE md5sum it and its in disk cache I get 115 megabytes/sec read
speed off rsync or NFS. As soon as it has to physicall read off the
disks speeds go down to 90 megabytes/sec (changing logbias to thorughput
upped this to 100). Using tar/mbuffer I am able to get a constant 115
megabytes/sec

>> >> The only way I was able to get full gig (935 megabits) was
using tar and
>> >> mbuffer due to it acting as a read-ahead buffer. is there
anyway to turn
>> >> the prefetch up as there really is no reason I should only be
getting
>> >> 720 megabits when copying files off with rsync (or NFS) like I
am
>> >> seeing.
>>     
> >
> > While there have been some unfortunate bugs in the prefetch algorithm,
> > a problem when sending many smaller files is that it takes a bit of
> > time for the prefetch to ramp up for each individual file. Zfs needs
> > to learn that prefetch is valuable for the file.  For obvious reasons,
> > zfs does not assume maximum prefetch immediately after the file has
> > been opened.  For a while now I have argued that zfs should be able to
> > learn filesystem and process behavior based on recent activity and
> > should dynamically tune prefech based on that knowledge.
> >
> > Bob
>   Actually I am running into this problem when copying several hundred
megabyte files that take more than a second to copy. Can the prefetch be
tuned so it reads more data in advanced?

zfs discuss - May 2010 - Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds