Sandon Van Ness
2010-May-30 22:04 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
Basically for a few seconds at a time I can get very nice speeds through rsync (saturating a 1 gig link) which is around 112-113 megabytes/sec which is about as good as I can expect after overhead. The problem is that every 5 seconds when data is actually written to disks (physically looking at the disk LEDs I see the activity on the one sending data stall as the ZFS machine is showing disk activity (writes). Basically this problem is bringing down the average write speed from around 112-113 megabytes/sec to around 100 megabytes/sec (sometimes lower) and thus lowering speeds by 10% (or a bit more). I only really care as 10% can make a difference when you are copying terrabytes of data to the machine. Anyway here is what I am seeing on the linux machine that is sending to the ZFS machine: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdc3 17.00 0.00 496.00 0.00 112.00 0.00 462.45 0.42 0.84 0.16 7.90 avg-cpu: %user %nice %system %iowait %steal %idle 6.34 0.00 3.80 0.00 0.00 89.86 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdc3 1.00 0.00 584.00 0.00 111.75 0.00 391.88 0.43 0.74 0.12 7.10 avg-cpu: %user %nice %system %iowait %steal %idle 6.84 0.00 5.44 0.00 0.00 87.72 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdc3 4.00 0.00 557.00 0.00 112.00 0.00 411.81 0.43 0.77 0.13 7.10 avg-cpu: %user %nice %system %iowait %steal %idle 4.64 0.00 2.51 0.00 0.00 92.86 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdc3 1.98 0.00 104.95 0.00 23.76 0.00 463.70 0.09 0.84 0.17 1.78 avg-cpu: %user %nice %system %iowait %steal %idle 7.11 0.00 5.08 0.00 0.00 87.82 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdc3 9.00 0.00 538.00 0.00 112.69 0.00 428.97 0.39 0.72 0.14 7.50 avg-cpu: %user %nice %system %iowait %steal %idle 7.46 0.00 5.56 0.00 0.00 86.98 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdc3 8.00 0.00 524.00 0.00 112.00 0.00 437.74 0.38 0.72 0.13 6.90 avg-cpu: %user %nice %system %iowait %steal %idle 7.98 0.00 5.96 0.00 0.00 86.06 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdc3 1.00 0.00 493.00 0.00 111.29 0.00 462.31 0.39 0.80 0.16 7.90 This is iostat -xm 1 sdc3 Basically you can see its reading at around full gig speed and then reads drop down due to writes being stalled on the ZFS machine. These are 1 second averages. On the ZFS end with 10 second averages I see approximately 100 MB/sec: data 25.1G 18.1T 0 834 0 100M data 26.2G 18.1T 0 833 0 100M data 27.3G 18.1T 0 833 0 100M Change this to 1 second and I see: data 32.7G 18.1T 0 0 0 0 data 32.7G 18.1T 0 2.86K 0 360M data 33.3G 18.1T 0 264 0 21.8M data 33.3G 18.1T 0 0 0 0 data 33.3G 18.1T 0 0 0 0 data 33.3G 18.1T 0 0 0 0 data 33.3G 18.1T 0 2.94K 0 369M data 33.8G 18.1T 0 375 0 35.1M data 33.8G 18.1T 0 0 0 0 data 33.8G 18.1T 0 0 0 0 data 33.8G 18.1T 0 0 0 0 data 33.8G 18.1T 0 2.90K 0 365M data 34.4G 18.1T 0 599 0 62.6M data 34.4G 18.1T 0 0 0 0 data 34.4G 18.1T 0 0 0 0 data 34.4G 18.1T 0 0 0 0 data 34.4G 18.1T 0 2.10K 0 265M data 34.9G 18.1T 0 1.77K 0 211M I tried changing the txg sync time from 30 to 1 and that did make things more smooth but in general lowered speeds (down to 90 megabytes/sec or so). Actually writing files to the array I see well excess of 112 megabytes/sec so I would think I should be able to get this to go at full gig speeds without the small stalls: root at opensolaris: 11:36 AM :/data# dd bs=1M count=100000 if=/dev/zero of=./100gb.bin 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 233.257 s, 450 MB/s
Richard Elling
2010-May-30 23:22 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On May 30, 2010, at 3:04 PM, Sandon Van Ness wrote:> Basically for a few seconds at a time I can get very nice speeds through > rsync (saturating a 1 gig link) which is around 112-113 megabytes/sec > which is about as good as I can expect after overhead. The problem is > that every 5 seconds when data is actually written to disks (physically > looking at the disk LEDs I see the activity on the one sending data > stall as the ZFS machine is showing disk activity (writes). > > Basically this problem is bringing down the average write speed from > around 112-113 megabytes/sec to around 100 megabytes/sec (sometimes > lower) and thus lowering speeds by 10% (or a bit more). I only really > care as 10% can make a difference when you are copying terrabytes of > data to the machine. Anyway here is what I am seeing on the linux > machine that is sending to the ZFS machine: > > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sdc3 17.00 0.00 496.00 0.00 112.00 0.00 > 462.45 0.42 0.84 0.16 7.90 > > avg-cpu: %user %nice %system %iowait %steal %idle > 6.34 0.00 3.80 0.00 0.00 89.86 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sdc3 1.00 0.00 584.00 0.00 111.75 0.00 > 391.88 0.43 0.74 0.12 7.10 > > avg-cpu: %user %nice %system %iowait %steal %idle > 6.84 0.00 5.44 0.00 0.00 87.72 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sdc3 4.00 0.00 557.00 0.00 112.00 0.00 > 411.81 0.43 0.77 0.13 7.10 > > avg-cpu: %user %nice %system %iowait %steal %idle > 4.64 0.00 2.51 0.00 0.00 92.86 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sdc3 1.98 0.00 104.95 0.00 23.76 0.00 > 463.70 0.09 0.84 0.17 1.78 > > avg-cpu: %user %nice %system %iowait %steal %idle > 7.11 0.00 5.08 0.00 0.00 87.82 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sdc3 9.00 0.00 538.00 0.00 112.69 0.00 > 428.97 0.39 0.72 0.14 7.50 > > avg-cpu: %user %nice %system %iowait %steal %idle > 7.46 0.00 5.56 0.00 0.00 86.98 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sdc3 8.00 0.00 524.00 0.00 112.00 0.00 > 437.74 0.38 0.72 0.13 6.90 > > avg-cpu: %user %nice %system %iowait %steal %idle > 7.98 0.00 5.96 0.00 0.00 86.06 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sdc3 1.00 0.00 493.00 0.00 111.29 0.00 > 462.31 0.39 0.80 0.16 7.90 > > > This is iostat -xm 1 sdc3 > > Basically you can see its reading at around full gig speed and then > reads drop down due to writes being stalled on the ZFS machine. These > are 1 second averages. > > On the ZFS end with 10 second averages I see approximately 100 MB/sec: > > > data 25.1G 18.1T 0 834 0 100M > data 26.2G 18.1T 0 833 0 100M > data 27.3G 18.1T 0 833 0 100M > > > Change this to 1 second and I see: > data 32.7G 18.1T 0 0 0 0 > data 32.7G 18.1T 0 2.86K 0 360M > data 33.3G 18.1T 0 264 0 21.8M > data 33.3G 18.1T 0 0 0 0 > data 33.3G 18.1T 0 0 0 0 > data 33.3G 18.1T 0 0 0 0 > data 33.3G 18.1T 0 2.94K 0 369M > data 33.8G 18.1T 0 375 0 35.1M > data 33.8G 18.1T 0 0 0 0 > data 33.8G 18.1T 0 0 0 0 > data 33.8G 18.1T 0 0 0 0 > data 33.8G 18.1T 0 2.90K 0 365M > data 34.4G 18.1T 0 599 0 62.6M > data 34.4G 18.1T 0 0 0 0 > data 34.4G 18.1T 0 0 0 0 > data 34.4G 18.1T 0 0 0 0 > data 34.4G 18.1T 0 2.10K 0 265M > data 34.9G 18.1T 0 1.77K 0 211M > > > I tried changing the txg sync time from 30 to 1 and that did make things > more smooth but in general lowered speeds (down to 90 megabytes/sec or > so). Actually writing files to the array I see well excess of 112 > megabytes/sec so I would think I should be able to get this to go at > full gig speeds without the small stalls:I have better luck tuning the zfs_txg_synctime_ms from 5000 to 1000 or less.> > root at opensolaris: 11:36 AM :/data# dd bs=1M count=100000 if=/dev/zero > of=./100gb.bin > 100000+0 records in > 100000+0 records out > 104857600000 bytes (105 GB) copied, 233.257 s, 450 MB/sIf you want to decouple the txg commit completely, then you might consider using a buffer of some sort. I use mbuffer for pipes, but that may be tricky to use in an rsync environment. -- richard -- Richard Elling richard at nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/
Sandon Van Ness
2010-May-31 04:06 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On 05/30/2010 04:22 PM, Richard Elling wrote:> If you want to decouple the txg commit completely, then you might consider > using a buffer of some sort. I use mbuffer for pipes, but that may be tricky > to use in an rsync environment. > -- richard >I initially thought this was I/O but now I think it might be a CPU bottleneck causing the problem. When I write zero''s using DD I can get quite good speed (~500 megabytes/sec) but I think the problem is that it totally maxes both of my CPU''s cores when I do so (2-5% idle in top). The problem is that when it does the write burst its taking away CPU usage from rsync which is actually what might be causing the dip during writes (not the I/O activity itself) but the CPU generated from the writes. I actually verified this by making a non raidz2 pool and just added all the disks into a pool with nothing specified: zpool create data c4t5000C500028BD5FCd0p0 c4t5000C50009A4D727d0p0 c4t5000C50009A46AF5d0p0 c4t5000C50009A515B0d0p0 c4t5000C500028A81BEd0p0 c4t5000C500028B44A1d0p0 c4t5000C500028B415Bd0p0 c4t5000C500028B23D2d0p0 c4t5000C5000CC3338Dd0p0 c4t5000C500027F59C8d0p0 c4t5000C50009DBF8D4d0p0 c4t5000C500027F3C1Fd0p0 c4t5000C5000DAF02F3d0p0 c4t5000C5000DA7ED4Ed0p0 c4t5000C5000DAEF990d0p0 c4t5000C5000DAEEF8Ed0p0 c4t5000C5000DAEB881d0p0 c4t5000C5000A121581d0p0 c4t5000C5000DAC848Fd0p0 c4t5000C50002770EE6d0p0 Once I did this I got a nice stable 115 megabyte/ssec over the network so 11.5% better. So the problem appears to be that when it goes to write data it will use 100% of its cpu power (even for x number of ms of time) which stalls the network while it does this. This happens when it does parity calculations but doesn''t when using a non parity zpool. I don''t think I can throttle parity calculations at all so I don''t think there will be a fix to this unfortunately =(. I can live with losing 10% off my rsync speed though.
Bob Friesenhahn
2010-May-31 20:13 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On Sun, 30 May 2010, Sandon Van Ness wrote:> > The problem is that when it does the write burst its taking away CPU > usage from rsync which is actually what might be causing the dip during > writes (not the I/O activity itself) but the CPU generated from the writes.You didn''t say which Solaris you are using, but I think that there are recent updates to Solaris 10 and OpenSolaris which are supposed to solve this problem (zfs blocking access to CPU by applications).>From Solaris 10 x86 (kernel 142901-09):6586537 async zio taskqs can block out userland commands Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Sandon Van Ness
2010-May-31 20:32 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On 05/31/2010 01:13 PM, Bob Friesenhahn wrote:> On Sun, 30 May 2010, Sandon Van Ness wrote: >> >> The problem is that when it does the write burst its taking away CPU >> usage from rsync which is actually what might be causing the dip during >> writes (not the I/O activity itself) but the CPU generated from the >> writes. > > You didn''t say which Solaris you are using, but I think that there are > recent updates to Solaris 10 and OpenSolaris which are supposed to > solve this problem (zfs blocking access to CPU by applications). > > From Solaris 10 x86 (kernel 142901-09): > > 6586537 async zio taskqs can block out userland commands > > BobI am using opensolaris snv_134: root at opensolaris: 01:32 PM :~# uname -a SunOS opensolaris 5.11 snv_134 i86pc i386 i86pc Is there a setting that to change the cpu scheduler for the ZFS process or is it supposed to just be fixed out of the box? I figure if the parity calculation process can be ''reniced'' or something this problem could likely be fixed as it ended up being CPU and not disk I/O like I originally thought.
Bob Friesenhahn
2010-May-31 20:51 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On Mon, 31 May 2010, Sandon Van Ness wrote:>> 6586537 async zio taskqs can block out userland commands >> >> Bob > I am using opensolaris snv_134: > > root at opensolaris: 01:32 PM :~# uname -a > SunOS opensolaris 5.11 snv_134 i86pc i386 i86pc > > Is there a setting that to change the cpu scheduler for the ZFS process > or is it supposed to just be fixed out of the box? I figure if the > parity calculation process can be ''reniced'' or something this problem > could likely be fixed as it ended up being CPU and not disk I/O like I > originally thought.There are multiple factors at work. Your OpenSolaris should be new enough to have the fix in which the zfs I/O tasks are run in in a scheduling class at lower priority than normal user processes. However, there is also a throttling mechanism for processes which produce data faster than can be consumed by the disks. This throttling mechanism depends on the amount of RAM available to zfs and the write speed of the I/O channel. More available RAM results in more write buffering, which results in a larger chunk of data written at the next transaction group write interval. The maximum size of a transaction group may be configured in /etc/system similar to: * Set ZFS maximum TXG group size to 2684354560 set zfs:zfs_write_limit_override = 0xa0000000 If the transaction group is smaller, then zfs will need to write more often. Processes will still be throttled but the duration of the delay should be smaller due to less data to write in each burst. I think that (with multiple writers) the zfs pool will be "healthier" and less fragmented if you can offer zfs more RAM and accept some stalls during writing. There are always tradeoffs. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Sandon Van Ness
2010-May-31 21:32 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On 05/31/2010 01:51 PM, Bob Friesenhahn wrote:> There are multiple factors at work. Your OpenSolaris should be new > enough to have the fix in which the zfs I/O tasks are run in in a > scheduling class at lower priority than normal user processes. > However, there is also a throttling mechanism for processes which > produce data faster than can be consumed by the disks. This > throttling mechanism depends on the amount of RAM available to zfs and > the write speed of the I/O channel. More available RAM results in > more write buffering, which results in a larger chunk of data written > at the next transaction group write interval. The maximum size of a > transaction group may be configured in /etc/system similar to: > > * Set ZFS maximum TXG group size to 2684354560 > set zfs:zfs_write_limit_override = 0xa0000000 > > If the transaction group is smaller, then zfs will need to write more > often. Processes will still be throttled but the duration of the > delay should be smaller due to less data to write in each burst. I > think that (with multiple writers) the zfs pool will be "healthier" > and less fragmented if you can offer zfs more RAM and accept some > stalls during writing. There are always tradeoffs. > > Bobwell it seems like when messing with the txg sync times and stuff like that it did make the transfer more smooth but didn''t actually help with speeds as it just meant the hangs happened for a shorter time but at a smaller interval and actually lowering the time between writes just seemed to make things worse (slightly). I think I have came to the conclusion that the problem here is CPU due to the fact that its only doing this with parity raid. I would think if it was I/O based then it would be the same as if anything its heavier on I/O on non parity raid due to the fact that it is no longer CPU bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with parity raidz2). So if I am understnading things the issueI am seeing should be fixed but aparrantly its not (in my case) as CPU usage from parity/zfs calculations are takin g precidence over my process doing the writting (rsync)? I think I have near 100% come to the conclusion that the issue is CPU based due the fact I saw the same dips even when using mbuffer.
Mike Gerdts
2010-May-31 21:52 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On Mon, May 31, 2010 at 4:32 PM, Sandon Van Ness <sandon at van-ness.com> wrote:> On 05/31/2010 01:51 PM, Bob Friesenhahn wrote: >> There are multiple factors at work. ?Your OpenSolaris should be new >> enough to have the fix in which the zfs I/O tasks are run in in a >> scheduling class at lower priority than normal user processes. >> However, there is also a throttling mechanism for processes which >> produce data faster than can be consumed by the disks. ?This >> throttling mechanism depends on the amount of RAM available to zfs and >> the write speed of the I/O channel. ?More available RAM results in >> more write buffering, which results in a larger chunk of data written >> at the next transaction group write interval. ?The maximum size of a >> transaction group may be configured in /etc/system similar to: >> >> * Set ZFS maximum TXG group size to 2684354560 >> set zfs:zfs_write_limit_override = 0xa0000000 >> >> If the transaction group is smaller, then zfs will need to write more >> often. ?Processes will still be throttled but the duration of the >> delay should be smaller due to less data to write in each burst. ?I >> think that (with multiple writers) the zfs pool will be "healthier" >> and less fragmented if you can offer zfs more RAM and accept some >> stalls during writing. ?There are always tradeoffs. >> >> Bob > well it seems like when messing with the txg sync times and stuff like > that it did make the transfer more smooth but didn''t actually help with > speeds as it just meant the hangs happened for a shorter time but at a > smaller interval and actually lowering the time between writes just > seemed to make things worse (slightly). > > I think I have came to the conclusion that the problem here is CPU due > to the fact that its only doing this with parity raid. I would think if > it was I/O based then it would be the same as if anything its heavier on > I/O on non parity raid due to the fact that it is no longer CPU > bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with > parity raidz2).To see if the CPU is pegged, take a look at the output of: mpstat 1 prstat -mLc 1 If mpstat shows that the idle time reaches 0 or the process'' latency column is more then a few tenths of a percent, you are probably short on CPU. It could also be that interrupts are stealing cycles from rsync. Placing it in a processor set with interrupts disabled in that processor set may help. -- Mike Gerdts http://mgerdts.blogspot.com/
Sandon Van Ness
2010-May-31 21:53 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On 05/31/2010 02:32 PM, Sandon Van Ness wrote:> well it seems like when messing with the txg sync times and stuff like > that it did make the transfer more smooth but didn''t actually help with > speeds as it just meant the hangs happened for a shorter time but at a > smaller interval and actually lowering the time between writes just > seemed to make things worse (slightly). > > I think I have came to the conclusion that the problem here is CPU due > to the fact that its only doing this with parity raid. I would think if > it was I/O based then it would be the same as if anything its heavier on > I/O on non parity raid due to the fact that it is no longer CPU > bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with > parity raidz2). > > So if I am understnading things the issueI am seeing should be fixed but > aparrantly its not (in my case) as CPU usage from parity/zfs > calculations are takin g precidence over my process doing the writting > (rsync)? > > I think I have near 100% come to the conclusion that the issue is CPU > based due the fact I saw the same dips even when using mbufferAnd here is some top output the slowdowns occur when zfs-pool starts using cpu and rsnc gets CPU starved: Normal activity shows: last pid: 22635; load avg: 2.17, 2.18, 2.16; up 0+18:04:42 14:53:29 59 processes: 57 sleeping, 1 running, 1 on cpu CPU states: 54.7% idle, 23.4% user, 21.9% kernel, 0.0% iowait, 0.0% swap Kernel: 37646 ctxsw, 193 trap, 20914 intr, 45295 syscall Memory: 4027M phys mem, 190M free mem, 2013M total swap, 2013M free swap PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND 1326 root 1 59 -20 383M 44M run 496:47 28.87% rsync 1322 root 1 59 -20 383M 357M sleep 11:21 0.70% rsync 3 root 1 60 -20 0K 0K sleep 1:24 0.06% fsflush when starved: last pid: 22636; load avg: 2.16, 2.18, 2.16; up 0+18:05:16 14:54:03 59 processes: 57 sleeping, 2 on cpu CPU states: 24.9% idle, 10.5% user, 64.6% kernel, 0.0% iowait, 0.0% swap Kernel: 17855 ctxsw, 18 trap, 12831 intr, 21090 syscall Memory: 4027M phys mem, 198M free mem, 2013M total swap, 2013M free swap PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND 604 root 39 99 -20 0K 0K cpu/0 316:55 53.36% zpool-data 1326 root 1 59 -20 383M 44M sleep 497:03 13.49% rsync 1322 root 1 59 -20 383M 357M sleep 11:21 0.33% rsync 22635 root 1 59 0 3852K 1912K cpu/1 0:00 0.06% top 3 root 1 60 -20 0K 0K sleep 1:24 0.06% fsflush The stall actually happens less than a second but the solaris version of top doesn''t seem to be able to take <1 values (other than 0) when using -s like you can on linux (-d .5) otherwise I think the zpool-data would be near 100% cpu during the stall.
Sandon Van Ness
2010-May-31 21:58 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On 05/31/2010 02:52 PM, Mike Gerdts wrote:> On Mon, May 31, 2010 at 4:32 PM, Sandon Van Ness <sandon at van-ness.com> wrote: > >> On 05/31/2010 01:51 PM, Bob Friesenhahn wrote: >> >>> There are multiple factors at work. Your OpenSolaris should be new >>> enough to have the fix in which the zfs I/O tasks are run in in a >>> scheduling class at lower priority than normal user processes. >>> However, there is also a throttling mechanism for processes which >>> produce data faster than can be consumed by the disks. This >>> throttling mechanism depends on the amount of RAM available to zfs and >>> the write speed of the I/O channel. More available RAM results in >>> more write buffering, which results in a larger chunk of data written >>> at the next transaction group write interval. The maximum size of a >>> transaction group may be configured in /etc/system similar to: >>> >>> * Set ZFS maximum TXG group size to 2684354560 >>> set zfs:zfs_write_limit_override = 0xa0000000 >>> >>> If the transaction group is smaller, then zfs will need to write more >>> often. Processes will still be throttled but the duration of the >>> delay should be smaller due to less data to write in each burst. I >>> think that (with multiple writers) the zfs pool will be "healthier" >>> and less fragmented if you can offer zfs more RAM and accept some >>> stalls during writing. There are always tradeoffs. >>> >>> Bob >>> >> well it seems like when messing with the txg sync times and stuff like >> that it did make the transfer more smooth but didn''t actually help with >> speeds as it just meant the hangs happened for a shorter time but at a >> smaller interval and actually lowering the time between writes just >> seemed to make things worse (slightly). >> >> I think I have came to the conclusion that the problem here is CPU due >> to the fact that its only doing this with parity raid. I would think if >> it was I/O based then it would be the same as if anything its heavier on >> I/O on non parity raid due to the fact that it is no longer CPU >> bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with >> parity raidz2). >> > To see if the CPU is pegged, take a look at the output of: > > mpstat 1 > prstat -mLc 1 > > If mpstat shows that the idle time reaches 0 or the process'' latency > column is more then a few tenths of a percent, you are probably short > on CPU. > > It could also be that interrupts are stealing cycles from rsync. > Placing it in a processor set with interrupts disabled in that > processor set may help. > >Unfortunately none of these utilies make it possible to ge values for <1 second which is what the hang is (its happening for about 1/2 of a second). Here is with mpstat: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 20229 20029 23175 25 222 972 0 322 0 31 0 69 1 0 0 0 43 9 14926 10 221 909 0 44606 47 14 0 40 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 20230 20029 23618 27 210 914 0 492 0 31 0 69 1 0 0 0 50 6 14614 12 215 875 0 45930 47 14 0 39 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 11551 11335 12715 39 154 702 0 385 0 62 0 38 1 0 0 0 2457 2379 8545 35 150 739 0 24349 25 53 0 22 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 14 12804 12591 14333 243 204 610 0 794 1 20 0 79 1 0 0 4 977 863 9561 11 210 559 0 24590 27 10 0 63 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 20239 20035 22932 46 224 996 0 537 0 31 0 69 1 0 0 0 66 7 15498 12 227 930 0 44808 46 14 0 39 Here is what i get with prstat: Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 1326 root 47 11 0.0 0.0 0.0 0.0 40 1.6 6K 29 45K 0 rsync/1 1322 root 0.0 1.4 0.0 0.0 0.0 0.0 93 5.4 6K 0 0 0 rsync/1 22644 root 0.0 0.1 0.0 0.0 0.0 0.0 100 0.0 16 9 255 0 prstat/1 196 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 15 0 105 0 nscd/17 14409 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 2 3 31 0 sshd/1 446 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 1 0 1 0 fmd/24 196 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 1 0 1 0 nscd/13 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/15 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/14 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/13 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/12 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/11 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/10 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/9 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/8 Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 604 root 0.0 33 0.0 0.0 0.0 0.0 42 25 18 13 0 0 zpool-data/13 604 root 0.0 30 0.0 0.0 0.0 0.0 41 29 12 12 0 0 zpool-data/15 1326 root 12 2.9 0.0 0.0 0.0 0.0 85 0.4 1K 12 11K 0 rsync/1 604 root 0.0 15 0.0 0.0 0.0 0.0 41 44 111 9 0 0 zpool-data/27 604 root 0.0 14 0.0 0.0 0.0 0.0 43 42 72 3 0 0 zpool-data/33 604 root 0.0 5.9 0.0 0.0 0.0 0.0 41 53 109 6 0 0 zpool-data/19 604 root 0.0 5.4 0.0 0.0 0.0 0.0 42 53 106 8 0 0 zpool-data/25 604 root 0.0 5.3 0.0 0.0 0.0 0.0 43 51 107 7 0 0 zpool-data/21 604 root 0.0 4.5 0.0 0.0 0.0 0.0 41 54 110 4 0 0 zpool-data/31 604 root 0.0 3.9 0.0 0.0 0.0 0.0 41 55 109 3 0 0 zpool-data/23 604 root 0.0 3.7 0.0 0.0 0.0 0.0 44 52 111 2 0 0 zpool-data/29 1322 root 0.0 0.4 0.0 0.0 0.0 0.0 98 2.0 1K 0 1 0 rsync/1 22644 root 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 16 13 255 0 prstat/1 14409 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 5 3 69 0 sshd/1 196 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 15 2 105 0 nscd/17 Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 1326 root 44 11 0.0 0.0 0.0 0.0 44 1.3 5K 24 42K 0 rsync/1 1322 root 0.0 1.4 0.0 0.0 0.0 0.0 93 5.9 5K 0 0 0 rsync/1 22644 root 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 16 0 255 0 prstat/1 604 root 0.0 0.1 0.0 0.0 0.0 0.0 100 0.0 1 0 0 0 zpool-data/2 14409 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 14 1 133 0 sshd/1 196 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 15 0 105 0 nscd/17 604 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 62 0 0 0 zpool-data/3 604 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 28 0 0 0 zpool-data/39 196 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 1 1 5 0 nscd/7 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/15 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/14 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/13 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/12 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/11 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/10 Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 1326 root 47 11 0.0 0.0 0.0 0.0 40 1.5 6K 34 45K 0 rsync/1 1322 root 0.0 1.4 0.0 0.0 0.0 0.0 93 5.8 6K 0 0 0 rsync/1 22644 root 0.0 0.1 0.0 0.0 0.0 0.0 100 0.0 16 1 255 0 prstat/1 196 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 15 0 105 0 nscd/17 14409 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 2 3 31 0 sshd/1 178 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 1 0 2 0 devfsadm/3 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/15 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/14 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/13 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/12 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/11 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/10 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/9 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/8 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 zpool-rpool/7 Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15
Sandon Van Ness
2010-May-31 22:33 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On 05/31/2010 01:13 PM, Bob Friesenhahn wrote:> On Sun, 30 May 2010, Sandon Van Ness wrote: >> >> The problem is that when it does the write burst its taking away CPU >> usage from rsync which is actually what might be causing the dip during >> writes (not the I/O activity itself) but the CPU generated from the >> writes. > > You didn''t say which Solaris you are using, but I think that there are > recent updates to Solaris 10 and OpenSolaris which are supposed to > solve this problem (zfs blocking access to CPU by applications). > > From Solaris 10 x86 (kernel 142901-09): > > 6586537 async zio taskqs can block out userland commands > > BobBtw I tried the work around in that bug: http://bugs.opensolaris.org/view_bug.do?bug_id=6586537 However I get: zio_taskq_threads is not defined in ''zfs'' module when I make the change to /etc/system also I get: root at opensolaris: 03:20 PM :/etc# echo zio_taskq_threads/D | mdb mdb: failed to dereference symbol: unknown symbol name also i can see its running a ton of threads from prstat from earlier. Anyone know a way I can reduce it to one thread as I think that work around will work for me.
Mike Gerdts
2010-May-31 22:34 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
Sorry, turned on html mode to avoid gmail''s line wrapping. On Mon, May 31, 2010 at 4:58 PM, Sandon Van Ness <sandon at van-ness.com>wrote:> On 05/31/2010 02:52 PM, Mike Gerdts wrote: > > On Mon, May 31, 2010 at 4:32 PM, Sandon Van Ness <sandon at van-ness.com> > wrote: > > > >> On 05/31/2010 01:51 PM, Bob Friesenhahn wrote: > >> > >>> There are multiple factors at work. Your OpenSolaris should be new > >>> enough to have the fix in which the zfs I/O tasks are run in in a > >>> scheduling class at lower priority than normal user processes. > >>> However, there is also a throttling mechanism for processes which > >>> produce data faster than can be consumed by the disks. This > >>> throttling mechanism depends on the amount of RAM available to zfs and > >>> the write speed of the I/O channel. More available RAM results in > >>> more write buffering, which results in a larger chunk of data written > >>> at the next transaction group write interval. The maximum size of a > >>> transaction group may be configured in /etc/system similar to: > >>> > >>> * Set ZFS maximum TXG group size to 2684354560 > >>> set zfs:zfs_write_limit_override = 0xa0000000 > >>> > >>> If the transaction group is smaller, then zfs will need to write more > >>> often. Processes will still be throttled but the duration of the > >>> delay should be smaller due to less data to write in each burst. I > >>> think that (with multiple writers) the zfs pool will be "healthier" > >>> and less fragmented if you can offer zfs more RAM and accept some > >>> stalls during writing. There are always tradeoffs. > >>> > >>> Bob > >>> > >> well it seems like when messing with the txg sync times and stuff like > >> that it did make the transfer more smooth but didn''t actually help with > >> speeds as it just meant the hangs happened for a shorter time but at a > >> smaller interval and actually lowering the time between writes just > >> seemed to make things worse (slightly). > >> > >> I think I have came to the conclusion that the problem here is CPU due > >> to the fact that its only doing this with parity raid. I would think if > >> it was I/O based then it would be the same as if anything its heavier on > >> I/O on non parity raid due to the fact that it is no longer CPU > >> bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with > >> parity raidz2). > >> > > To see if the CPU is pegged, take a look at the output of: > > > > mpstat 1 > > prstat -mLc 1 > > > > If mpstat shows that the idle time reaches 0 or the process'' latency > > column is more then a few tenths of a percent, you are probably short > > on CPU. > > > > It could also be that interrupts are stealing cycles from rsync. > > Placing it in a processor set with interrupts disabled in that > > processor set may help. > > > > > > Unfortunately none of these utilies make it possible to ge values for <1 > second which is what the hang is (its happening for about 1/2 of a second). >> Here is with mpstat: ><snip - bad line wrapped lines removed>> > Here is what i get with prstat: ><snip - removed first interval & fixed formatting of next>> Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15 > PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG > PROCESS/LWPID > 604 root 0.0 33 0.0 0.0 0.0 0.0 42 25 18 13 0 0 > zpool-data/13 > 604 root 0.0 30 0.0 0.0 0.0 0.0 41 29 12 12 0 0 > zpool-data/15 > 1326 root 12 2.9 0.0 0.0 0.0 0.0 85 0.4 1K 12 11K 0 rsync/1 > 604 root 0.0 15 0.0 0.0 0.0 0.0 41 44 111 9 0 0 > zpool-data/27 > 604 root 0.0 14 0.0 0.0 0.0 0.0 43 42 72 3 0 0 > zpool-data/33 > 604 root 0.0 5.9 0.0 0.0 0.0 0.0 41 53 109 6 0 0 > zpool-data/19 > 604 root 0.0 5.4 0.0 0.0 0.0 0.0 42 53 106 8 0 0 > zpool-data/25 > 604 root 0.0 5.3 0.0 0.0 0.0 0.0 43 51 107 7 0 0 > zpool-data/21 > 604 root 0.0 4.5 0.0 0.0 0.0 0.0 41 54 110 4 0 0 > zpool-data/31 > 604 root 0.0 3.9 0.0 0.0 0.0 0.0 41 55 109 3 0 0 > zpool-data/23 > 604 root 0.0 3.7 0.0 0.0 0.0 0.0 44 52 111 2 0 0 > zpool-data/29 > 1322 root 0.0 0.4 0.0 0.0 0.0 0.0 98 2.0 1K 0 1 0 rsync/1 > 22644 root 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 16 13 255 0 prstat/1 > 14409 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 5 3 69 0 sshd/1 > 196 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 15 2 105 0 nscd/17 >In the interval above, it looks as though rsync is getting spending a very small amount of time waiting for a CPU (LAT or latency), but various zfs threads spent quite a bit of time (relative to the interval) waiting for a CPU. Compared to the next interval, rsync spent very little time on cpu (usr + sys): 14.9% vs. 55%, 58%, etc.). Perhaps it is not being fed data quickly enough because of CPU contention that prevents timely transferring of data from the NIC to rsync. Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15> PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG > PROCESS/LWPID > 1326 root 44 11 0.0 0.0 0.0 0.0 44 1.3 5K 24 42K 0 rsync/1 > 1322 root 0.0 1.4 0.0 0.0 0.0 0.0 93 5.9 5K 0 0 0 rsync/1 > 22644 root 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 16 0 255 0 prstat/1 > 604 root 0.0 0.1 0.0 0.0 0.0 0.0 100 0.0 1 0 0 0 zpool-data/2 > 14409 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 14 1 133 0 sshd/1 > 196 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 15 0 105 0 nscd/17 > 604 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 62 0 0 0 zpool-data/3 > 604 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 28 0 0 0 > zpool-data/39 > 196 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 1 1 5 0 nscd/7 > 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 > zpool-rpool/15 > 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 > zpool-rpool/14 > 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 > zpool-rpool/13 > 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 > zpool-rpool/12 > 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 > zpool-rpool/11 > 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 > zpool-rpool/10 > Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15 > PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG > PROCESS/LWPID > 1326 root 47 11 0.0 0.0 0.0 0.0 40 1.5 6K 34 45K 0 rsync/1 > 1322 root 0.0 1.4 0.0 0.0 0.0 0.0 93 5.8 6K 0 0 0 rsync/1 > 22644 root 0.0 0.1 0.0 0.0 0.0 0.0 100 0.0 16 1 255 0 prstat/1 > 196 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 15 0 105 0 nscd/17 > 14409 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 2 3 31 0 sshd/1 > 178 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 1 0 2 0 devfsadm/3 > 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 > zpool-rpool/15 > 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 > zpool-rpool/14 > 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 > zpool-rpool/13 > 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 > zpool-rpool/12 > 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 > zpool-rpool/11 > 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 > zpool-rpool/10 > 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 > zpool-rpool/9 > 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 > zpool-rpool/8 > 5 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 > zpool-rpool/7 > Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15 > >-- Mike Gerdts http://mgerdts.blogspot.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100531/14910034/attachment.html>
Bob Friesenhahn
2010-May-31 23:45 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On Mon, 31 May 2010, Sandon Van Ness wrote:> > I think I have came to the conclusion that the problem here is CPU due > to the fact that its only doing this with parity raid. I would think if > it was I/O based then it would be the same as if anything its heavier on > I/O on non parity raid due to the fact that it is no longer CPU > bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with > parity raidz2).The "parity RAID" certainly does impose more computational overhead, but not because of the parity calcuation. You should put that out of your mind right away. With raidz, each 128K block is chopped into smaller chunks which are written across the disks in the vdev. This is less efficient (in many ways, but least of which is "parity") than writing 128K blocks to each disk in turn. You are creating a blast of smaller I/Os to the various disks which may seem like more CPU but could be related to PCI-E access, interrupts, or a controller bottleneck. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Sandon Van Ness
2010-Jun-01 04:27 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On 05/31/2010 04:45 PM, Bob Friesenhahn wrote:> On Mon, 31 May 2010, Sandon Van Ness wrote: >> >> I think I have came to the conclusion that the problem here is CPU due >> to the fact that its only doing this with parity raid. I would think if >> it was I/O based then it would be the same as if anything its heavier on >> I/O on non parity raid due to the fact that it is no longer CPU >> bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with >> parity raidz2). > > The "parity RAID" certainly does impose more computational overhead, > but not because of the parity calcuation. You should put that out of > your mind right away. With raidz, each 128K block is chopped into > smaller chunks which are written across the disks in the vdev. This > is less efficient (in many ways, but least of which is "parity") than > writing 128K blocks to each disk in turn. You are creating a blast of > smaller I/Os to the various disks which may seem like more CPU but > could be related to PCI-E access, interrupts, or a controller bottleneck. > > BobWith sequential writes I don''t see how parity writing would be any different from when I just created a 20 disk zpool which is doing the same writes every 5 seconds but the only difference is it isn''t maxing out CPU usage when doing the writes and and I don''t see the transfer stall during the writes like I did on raidz2.
Bob Friesenhahn
2010-Jun-01 14:57 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On Mon, 31 May 2010, Sandon Van Ness wrote:> With sequential writes I don''t see how parity writing would be any > different from when I just created a 20 disk zpool which is doing the > same writes every 5 seconds but the only difference is it isn''t maxing > out CPU usage when doing the writes and and I don''t see the transfer > stall during the writes like I did on raidz2.I am not understanding the above paragraph, but hopefully you agree that raidz2 issues many more writes (based on vdev stripe width) to the underlying disks than a simple non-redundant load-shared pool does. Depending on your system, this might not be an issue, but it is possible that there is an I/O threshold beyond which something (probably hardware) causes a performance issue. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Sandon Van Ness
2010-Jun-04 19:20 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On 06/01/2010 07:57 AM, Bob Friesenhahn wrote:> On Mon, 31 May 2010, Sandon Van Ness wrote: >> With sequential writes I don''t see how parity writing would be any >> different from when I just created a 20 disk zpool which is doing the >> same writes every 5 seconds but the only difference is it isn''t maxing >> out CPU usage when doing the writes and and I don''t see the transfer >> stall during the writes like I did on raidz2. > > I am not understanding the above paragraph, but hopefully you agree > that raidz2 issues many more writes (based on vdev stripe width) to > the underlying disks than a simple non-redundant load-shared pool > does. Depending on your system, this might not be an issue, but it is > possible that there is an I/O threshold beyond which something > (probably hardware) causes a performance issue. > > BobInteresting enough when I went to copy the data back I got even worse download speeds than I did write speeds! It looks like i need some sort of read-ahead as unlike the writes it doesn''t appear to be CPU bound as using mbuffer/tar gives me full gigabit speeds. You can see in my graph here: http://uverse.houkouonchi.jp/stats/netusage/1.1.1.3_2.html On the weekly graph is when I was sending to the ZFS server and then daily is showing it comming back but I stopped it and shut down the computer for a while which is the low speed flat line and then started it up again this time using mbuffer and speeds are great. I don''t see why I am having a trouble getting full speeds when doing reads unless it needs to read ahead more than it is. I decided to go ahead and to tar + mbuffer for the first pass and then run rsync after for the final sync just to make sure nothing was missed.
Bob Friesenhahn
2010-Jun-05 01:15 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On Fri, 4 Jun 2010, Sandon Van Ness wrote:> > Interesting enough when I went to copy the data back I got even worse > download speeds than I did write speeds! It looks like i need some sort > of read-ahead as unlike the writes it doesn''t appear to be CPU bound as > using mbuffer/tar gives me full gigabit speeds. You can see in my graph > here: > > http://uverse.houkouonchi.jp/stats/netusage/1.1.1.3_2.htmlI am still not sure what you are doing, however, it should not surprise that gigabit ethernet is limited to one gigabit of traffic (1000 Mb/s) in either direction. Theoretically you should be able to get a gigabit of traffic in both directions at once, but this depends on the quality of your ethernet switch, ethernet adaptor card, device driver, and capabilities of where the data is read and written to. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Sandon Van Ness
2010-Jun-05 03:53 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On 06/04/2010 06:15 PM, Bob Friesenhahn wrote:> On Fri, 4 Jun 2010, Sandon Van Ness wrote: >> >> Interesting enough when I went to copy the data back I got even worse >> download speeds than I did write speeds! It looks like i need some sort >> of read-ahead as unlike the writes it doesn''t appear to be CPU bound as >> using mbuffer/tar gives me full gigabit speeds. You can see in my graph >> here: >> >> http://uverse.houkouonchi.jp/stats/netusage/1.1.1.3_2.html > > I am still not sure what you are doing, however, it should not > surprise that gigabit ethernet is limited to one gigabit of traffic > (1000 Mb/s) in either direction. Theoretically you should be able to > get a gigabit of traffic in both directions at once, but this depends > on the quality of your ethernet switch, ethernet adaptor card, device > driver, and capabilities of where the data is read and written to. > > BobThe problem is that just using rsync I am not getting gigabit. For me gigabit maxes out at around 930-940 megabits. When I use rsync alone I only was getting around 720 megabits incomming. This is only when its reading from the block device. When reading from the memory (IE: cat a few big files on the server to have them cached) it gets ~935 megabits. The machine is easily able to sustain that read speed (and write) but the problem is getting it to actually do it. The only way I was able to get full gig (935 megabits) was using tar and mbuffer due to it acting as a read-ahead buffer. is there anyway to turn the prefetch up as there really is no reason I should only be getting 720 megabits when copying files off with rsync (or NFS) like I am seeing.
Bob Friesenhahn
2010-Jun-05 20:08 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On Fri, 4 Jun 2010, Sandon Van Ness wrote:> > The problem is that just using rsync I am not getting gigabit. For me > gigabit maxes out at around 930-940 megabits. When I use rsync alone I > only was getting around 720 megabits incomming. This is only when its > reading from the block device. When reading from the memory (IE: cat a > few big files on the server to have them cached) it gets ~935 megabits. > The machine is easily able to sustain that read speed (and write) but > the problem is getting it to actually do it.TCP imposes some overhead. Rsync chats back and forth so additional latency is added. Depending on settings, rsync may read an existing block on the receiving end and only send an update block if the data is different.> The only way I was able to get full gig (935 megabits) was using tar and > mbuffer due to it acting as a read-ahead buffer. is there anyway to turn > the prefetch up as there really is no reason I should only be getting > 720 megabits when copying files off with rsync (or NFS) like I am seeing.While there have been some unfortunate bugs in the prefetch algorithm, a problem when sending many smaller files is that it takes a bit of time for the prefetch to ramp up for each individual file. Zfs needs to learn that prefetch is valuable for the file. For obvious reasons, zfs does not assume maximum prefetch immediately after the file has been opened. For a while now I have argued that zfs should be able to learn filesystem and process behavior based on recent activity and should dynamically tune prefech based on that knowledge. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Sandon Van Ness
2010-Jun-05 20:23 UTC
[zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On 06/05/2010 01:08 PM, Bob Friesenhahn wrote:> > On Fri, 4 Jun 2010, Sandon Van Ness wrote: > >> >> >> >> The problem is that just using rsync I am not getting gigabit. For me >> >> gigabit maxes out at around 930-940 megabits. When I use rsync alone I >> >> only was getting around 720 megabits incomming. This is only when its >> >> reading from the block device. When reading from the memory (IE: cat a >> >> few big files on the server to have them cached) it gets ~935 megabits. >> >> The machine is easily able to sustain that read speed (and write) but >> >> the problem is getting it to actually do it. >> > > > > TCP imposes some overhead. Rsync chats back and forth so additional > > latency is added. Depending on settings, rsync may read an existing > > block on the receiving end and only send an update block if the data > > is different. >Its not TCP overhead and its not rsync. I see identical behavior when reading off NFS as well. Basically if the file is read on the server itself IE md5sum it and its in disk cache I get 115 megabytes/sec read speed off rsync or NFS. As soon as it has to physicall read off the disks speeds go down to 90 megabytes/sec (changing logbias to thorughput upped this to 100). Using tar/mbuffer I am able to get a constant 115 megabytes/sec>> >> The only way I was able to get full gig (935 megabits) was using tar and >> >> mbuffer due to it acting as a read-ahead buffer. is there anyway to turn >> >> the prefetch up as there really is no reason I should only be getting >> >> 720 megabits when copying files off with rsync (or NFS) like I am >> >> seeing. >> > > > > While there have been some unfortunate bugs in the prefetch algorithm, > > a problem when sending many smaller files is that it takes a bit of > > time for the prefetch to ramp up for each individual file. Zfs needs > > to learn that prefetch is valuable for the file. For obvious reasons, > > zfs does not assume maximum prefetch immediately after the file has > > been opened. For a while now I have argued that zfs should be able to > > learn filesystem and process behavior based on recent activity and > > should dynamically tune prefech based on that knowledge. > > > > Bob >Actually I am running into this problem when copying several hundred megabyte files that take more than a second to copy. Can the prefetch be tuned so it reads more data in advanced?