While doing some performance testing on a pair of X4540''s running snv_105, I noticed some odd behavior while using CIFS. I am copying a 6TB database file (yes, a single file) over our GigE network to the X4540, then snapshotting that data to the secondary X4540. Writing said 6TB file can peak our gigabit network, with about 95-100MB/sec going over the wire (can''t ask for any more, really). However, the disk IO on the X4540 appears unusual. I would expect the disks to be constantly writing 95-100MB/sec, but it appears it buffers about 1GB worth of data before committing to disk. This is in contrast to NFS write behavior, where as I write a 1GB file to the NFS server from an NFS client, traffic on the wire correlates concisely to the disk writes. For example, 60MB/sec on the wire via NFS will trigger 60MB/sec on disk. This is a single file on both cases. I wouldn''t have a problem with this "buffer", it seems to be a rolling 10-second buffer, if I am copying several small files at lower speeds, the disk buffer still seems to "purge" after roughly 10 seconds, not when a certain size is reached. The larger the amount of data that goes into the buffer is what causes a problem, writing 1GB to disk can cause the system to slow down substantially, all network traffic pauses or drops to mere kilobytes a second while it writes this buffer. I would like to see a smoother handling of this buffer, or a tuneable to make the buffer write more often or fill quicker. This is a 48TB unit, 64GB ram, and the arcstat perl script reports my ARC is 55GB in size, with near 0% miss on reads. Has anyone seen something similar, or know of any un-documented tuneables to reduce the effects of this? Here is ''zpool iostat'' output, in 1 second intervals while this "write storm" occurs". # zpool iostat pdxfilu01 1 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- pdxfilu01 2.09T 36.0T 1 61 143K 7.30M pdxfilu01 2.09T 36.0T 0 0 0 0 pdxfilu01 2.09T 36.0T 0 0 0 0 pdxfilu01 2.09T 36.0T 0 0 0 0 pdxfilu01 2.09T 36.0T 0 60 0 7.55M pdxfilu01 2.09T 36.0T 0 1.70K 0 211M pdxfilu01 2.09T 36.0T 0 2.56K 0 323M pdxfilu01 2.09T 36.0T 0 2.97K 0 375M pdxfilu01 2.09T 36.0T 0 3.15K 0 399M pdxfilu01 2.09T 36.0T 0 2.22K 0 244M pdxfilu01 2.09T 36.0T 0 0 0 0 pdxfilu01 2.09T 36.0T 0 0 0 0 pdxfilu01 2.09T 36.0T 0 0 0 0 pdxfilu01 2.09T 36.0T 0 0 0 0 Here is my ''zpool status'' output. # zpool status pool: pdxfilu01 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM pdxfilu01 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c6t0d0 ONLINE 0 0 0 c7t0d0 ONLINE 0 0 0 c8t0d0 ONLINE 0 0 0 c9t0d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 c9t1d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 c8t2d0 ONLINE 0 0 0 c9t2d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 c9t3d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c4t4d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c6t4d0 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 c8t5d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c4t6d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 c6t6d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 c8t6d0 ONLINE 0 0 0 c9t6d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c4t7d0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 c8t7d0 ONLINE 0 0 0 c9t7d0 ONLINE 0 0 0 spares c6t2d0 AVAIL c7t3d0 AVAIL c8t4d0 AVAIL c9t5d0 AVAIL -- Brent Jones brent at servuhome.net
On Mon, Jan 26, 2009 at 10:40 PM, Brent Jones <brent at servuhome.net> wrote:> While doing some performance testing on a pair of X4540''s running > snv_105, I noticed some odd behavior while using CIFS. > I am copying a 6TB database file (yes, a single file) over our GigE > network to the X4540, then snapshotting that data to the secondary > X4540. > Writing said 6TB file can peak our gigabit network, with about > 95-100MB/sec going over the wire (can''t ask for any more, really). > > However, the disk IO on the X4540 appears unusual. I would expect the > disks to be constantly writing 95-100MB/sec, but it appears it buffers > about 1GB worth of data before committing to disk. This is in contrast > to NFS write behavior, where as I write a 1GB file to the NFS server > from an NFS client, traffic on the wire correlates concisely to the > disk writes. For example, 60MB/sec on the wire via NFS will trigger > 60MB/sec on disk. This is a single file on both cases. > > I wouldn''t have a problem with this "buffer", it seems to be a rolling > 10-second buffer, if I am copying several small files at lower speeds, > the disk buffer still seems to "purge" after roughly 10 seconds, not > when a certain size is reached. The larger the amount of data that > goes into the buffer is what causes a problem, writing 1GB to disk can > cause the system to slow down substantially, all network traffic > pauses or drops to mere kilobytes a second while it writes this > buffer. > > I would like to see a smoother handling of this buffer, or a tuneable > to make the buffer write more often or fill quicker. > > This is a 48TB unit, 64GB ram, and the arcstat perl script reports my > ARC is 55GB in size, with near 0% miss on reads. > > Has anyone seen something similar, or know of any un-documented > tuneables to reduce the effects of this? > > > Here is ''zpool iostat'' output, in 1 second intervals while this "write > storm" occurs". > > > # zpool iostat pdxfilu01 1 > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > pdxfilu01 2.09T 36.0T 1 61 143K 7.30M > pdxfilu01 2.09T 36.0T 0 0 0 0 > pdxfilu01 2.09T 36.0T 0 0 0 0 > pdxfilu01 2.09T 36.0T 0 0 0 0 > pdxfilu01 2.09T 36.0T 0 60 0 7.55M > pdxfilu01 2.09T 36.0T 0 1.70K 0 211M > pdxfilu01 2.09T 36.0T 0 2.56K 0 323M > pdxfilu01 2.09T 36.0T 0 2.97K 0 375M > pdxfilu01 2.09T 36.0T 0 3.15K 0 399M > pdxfilu01 2.09T 36.0T 0 2.22K 0 244M > pdxfilu01 2.09T 36.0T 0 0 0 0 > pdxfilu01 2.09T 36.0T 0 0 0 0 > pdxfilu01 2.09T 36.0T 0 0 0 0 > pdxfilu01 2.09T 36.0T 0 0 0 0 > > > Here is my ''zpool status'' output. > > # zpool status > pool: pdxfilu01 > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > pdxfilu01 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c5t0d0 ONLINE 0 0 0 > c6t0d0 ONLINE 0 0 0 > c7t0d0 ONLINE 0 0 0 > c8t0d0 ONLINE 0 0 0 > c9t0d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c4t1d0 ONLINE 0 0 0 > c6t1d0 ONLINE 0 0 0 > c7t1d0 ONLINE 0 0 0 > c8t1d0 ONLINE 0 0 0 > c9t1d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c4t2d0 ONLINE 0 0 0 > c5t2d0 ONLINE 0 0 0 > c7t2d0 ONLINE 0 0 0 > c8t2d0 ONLINE 0 0 0 > c9t2d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c4t3d0 ONLINE 0 0 0 > c5t3d0 ONLINE 0 0 0 > c6t3d0 ONLINE 0 0 0 > c8t3d0 ONLINE 0 0 0 > c9t3d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c4t4d0 ONLINE 0 0 0 > c5t4d0 ONLINE 0 0 0 > c6t4d0 ONLINE 0 0 0 > c7t4d0 ONLINE 0 0 0 > c9t4d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c4t5d0 ONLINE 0 0 0 > c5t5d0 ONLINE 0 0 0 > c6t5d0 ONLINE 0 0 0 > c7t5d0 ONLINE 0 0 0 > c8t5d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c4t6d0 ONLINE 0 0 0 > c5t6d0 ONLINE 0 0 0 > c6t6d0 ONLINE 0 0 0 > c7t6d0 ONLINE 0 0 0 > c8t6d0 ONLINE 0 0 0 > c9t6d0 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c4t7d0 ONLINE 0 0 0 > c5t7d0 ONLINE 0 0 0 > c6t7d0 ONLINE 0 0 0 > c7t7d0 ONLINE 0 0 0 > c8t7d0 ONLINE 0 0 0 > c9t7d0 ONLINE 0 0 0 > spares > c6t2d0 AVAIL > c7t3d0 AVAIL > c8t4d0 AVAIL > c9t5d0 AVAIL > > > > -- > Brent Jones > brent at servuhome.net >I found some insight to the behavior I found at this Sun blog by Roch Bourbonnais : http://blogs.sun.com/roch/date/20080514 Excerpt from the section that I seem to have encountered: "The new code keeps track of the amount of data accepted in a TXG and the time it takes to sync. It dynamically adjusts that amount so that each TXG sync takes about 5 seconds (txg_time variable). It also clamps the limit to no more than 1/8th of physical memory. " So, when I fill up that transaction group buffer, that is when I see that 4-5 second "I/O burst" of several hundred megabytes per second. He also documents that the buffer flush can, and does issue delays to the writing threads, which is why I''m seeing those momentary drops in throughput and sluggish system performance while that write buffer is flushed to disk. Wish there was a better way to handle that, but at the speed I''m writing (and I''ll be getting a 10GigE link soon), I don''t see any other graceful methods of handling that much data in a buffer. Loving these X4540''s so far though... -- Brent Jones brent at servuhome.net
comment far below... Brent Jones wrote:> On Mon, Jan 26, 2009 at 10:40 PM, Brent Jones <brent at servuhome.net> wrote: > >> While doing some performance testing on a pair of X4540''s running >> snv_105, I noticed some odd behavior while using CIFS. >> I am copying a 6TB database file (yes, a single file) over our GigE >> network to the X4540, then snapshotting that data to the secondary >> X4540. >> Writing said 6TB file can peak our gigabit network, with about >> 95-100MB/sec going over the wire (can''t ask for any more, really). >> >> However, the disk IO on the X4540 appears unusual. I would expect the >> disks to be constantly writing 95-100MB/sec, but it appears it buffers >> about 1GB worth of data before committing to disk. This is in contrast >> to NFS write behavior, where as I write a 1GB file to the NFS server >> from an NFS client, traffic on the wire correlates concisely to the >> disk writes. For example, 60MB/sec on the wire via NFS will trigger >> 60MB/sec on disk. This is a single file on both cases. >> >> I wouldn''t have a problem with this "buffer", it seems to be a rolling >> 10-second buffer, if I am copying several small files at lower speeds, >> the disk buffer still seems to "purge" after roughly 10 seconds, not >> when a certain size is reached. The larger the amount of data that >> goes into the buffer is what causes a problem, writing 1GB to disk can >> cause the system to slow down substantially, all network traffic >> pauses or drops to mere kilobytes a second while it writes this >> buffer. >> >> I would like to see a smoother handling of this buffer, or a tuneable >> to make the buffer write more often or fill quicker. >> >> This is a 48TB unit, 64GB ram, and the arcstat perl script reports my >> ARC is 55GB in size, with near 0% miss on reads. >> >> Has anyone seen something similar, or know of any un-documented >> tuneables to reduce the effects of this? >> >> >> Here is ''zpool iostat'' output, in 1 second intervals while this "write >> storm" occurs". >> >> >> # zpool iostat pdxfilu01 1 >> capacity operations bandwidth >> pool used avail read write read write >> ---------- ----- ----- ----- ----- ----- ----- >> pdxfilu01 2.09T 36.0T 1 61 143K 7.30M >> pdxfilu01 2.09T 36.0T 0 0 0 0 >> pdxfilu01 2.09T 36.0T 0 0 0 0 >> pdxfilu01 2.09T 36.0T 0 0 0 0 >> pdxfilu01 2.09T 36.0T 0 60 0 7.55M >> pdxfilu01 2.09T 36.0T 0 1.70K 0 211M >> pdxfilu01 2.09T 36.0T 0 2.56K 0 323M >> pdxfilu01 2.09T 36.0T 0 2.97K 0 375M >> pdxfilu01 2.09T 36.0T 0 3.15K 0 399M >> pdxfilu01 2.09T 36.0T 0 2.22K 0 244M >> pdxfilu01 2.09T 36.0T 0 0 0 0 >> pdxfilu01 2.09T 36.0T 0 0 0 0 >> pdxfilu01 2.09T 36.0T 0 0 0 0 >> pdxfilu01 2.09T 36.0T 0 0 0 0 >> >> >> Here is my ''zpool status'' output. >> >> # zpool status >> pool: pdxfilu01 >> state: ONLINE >> scrub: none requested >> config: >> >> NAME STATE READ WRITE CKSUM >> pdxfilu01 ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> c5t0d0 ONLINE 0 0 0 >> c6t0d0 ONLINE 0 0 0 >> c7t0d0 ONLINE 0 0 0 >> c8t0d0 ONLINE 0 0 0 >> c9t0d0 ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> c4t1d0 ONLINE 0 0 0 >> c6t1d0 ONLINE 0 0 0 >> c7t1d0 ONLINE 0 0 0 >> c8t1d0 ONLINE 0 0 0 >> c9t1d0 ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> c4t2d0 ONLINE 0 0 0 >> c5t2d0 ONLINE 0 0 0 >> c7t2d0 ONLINE 0 0 0 >> c8t2d0 ONLINE 0 0 0 >> c9t2d0 ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> c4t3d0 ONLINE 0 0 0 >> c5t3d0 ONLINE 0 0 0 >> c6t3d0 ONLINE 0 0 0 >> c8t3d0 ONLINE 0 0 0 >> c9t3d0 ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> c4t4d0 ONLINE 0 0 0 >> c5t4d0 ONLINE 0 0 0 >> c6t4d0 ONLINE 0 0 0 >> c7t4d0 ONLINE 0 0 0 >> c9t4d0 ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> c4t5d0 ONLINE 0 0 0 >> c5t5d0 ONLINE 0 0 0 >> c6t5d0 ONLINE 0 0 0 >> c7t5d0 ONLINE 0 0 0 >> c8t5d0 ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> c4t6d0 ONLINE 0 0 0 >> c5t6d0 ONLINE 0 0 0 >> c6t6d0 ONLINE 0 0 0 >> c7t6d0 ONLINE 0 0 0 >> c8t6d0 ONLINE 0 0 0 >> c9t6d0 ONLINE 0 0 0 >> raidz1 ONLINE 0 0 0 >> c4t7d0 ONLINE 0 0 0 >> c5t7d0 ONLINE 0 0 0 >> c6t7d0 ONLINE 0 0 0 >> c7t7d0 ONLINE 0 0 0 >> c8t7d0 ONLINE 0 0 0 >> c9t7d0 ONLINE 0 0 0 >> spares >> c6t2d0 AVAIL >> c7t3d0 AVAIL >> c8t4d0 AVAIL >> c9t5d0 AVAIL >> >> >> >> -- >> Brent Jones >> brent at servuhome.net >> >> > > I found some insight to the behavior I found at this Sun blog by Roch > Bourbonnais : http://blogs.sun.com/roch/date/20080514 > > Excerpt from the section that I seem to have encountered: > > "The new code keeps track of the amount of data accepted in a TXG and > the time it takes to sync. It dynamically adjusts that amount so that > each TXG sync takes about 5 seconds (txg_time variable). It also > clamps the limit to no more than 1/8th of physical memory. " > > So, when I fill up that transaction group buffer, that is when I see > that 4-5 second "I/O burst" of several hundred megabytes per second. > He also documents that the buffer flush can, and does issue delays to > the writing threads, which is why I''m seeing those momentary drops in > throughput and sluggish system performance while that write buffer is > flushed to disk. >Yes, this tends to be more efficient. You can tune it by setting zfs_txg_synctime, which is 5 by default. It is rare that we''ve seen this be a win, which is why we don''t mention it in the Evil Tuning Guide.> Wish there was a better way to handle that, but at the speed I''m > writing (and I''ll be getting a 10GigE link soon), I don''t see any > other graceful methods of handling that much data in a buffer >I think your workload might change dramatically when you get a faster pipe. So unless you really feel compelled to change it, I wouldn''t suggest changing it. -- richard> Loving these X4540''s so far though... > >
On Tue, Jan 27, 2009 at 5:47 PM, Richard Elling <richard.elling at gmail.com> wrote:> comment far below... > > Brent Jones wrote: >> >> On Mon, Jan 26, 2009 at 10:40 PM, Brent Jones <brent at servuhome.net> wrote: >> >>>>>> >>> >>> -- >>> Brent Jones >>> brent at servuhome.net >>> >>> >> >> I found some insight to the behavior I found at this Sun blog by Roch >> Bourbonnais : http://blogs.sun.com/roch/date/20080514 >> >> Excerpt from the section that I seem to have encountered: >> >> "The new code keeps track of the amount of data accepted in a TXG and >> the time it takes to sync. It dynamically adjusts that amount so that >> each TXG sync takes about 5 seconds (txg_time variable). It also >> clamps the limit to no more than 1/8th of physical memory. " >> >> So, when I fill up that transaction group buffer, that is when I see >> that 4-5 second "I/O burst" of several hundred megabytes per second. >> He also documents that the buffer flush can, and does issue delays to >> the writing threads, which is why I''m seeing those momentary drops in >> throughput and sluggish system performance while that write buffer is >> flushed to disk. >> > > Yes, this tends to be more efficient. You can tune it by setting > zfs_txg_synctime, which is 5 by default. It is rare that we''ve seen > this be a win, which is why we don''t mention it in the Evil Tuning > Guide. > >> Wish there was a better way to handle that, but at the speed I''m >> writing (and I''ll be getting a 10GigE link soon), I don''t see any >> other graceful methods of handling that much data in a buffer >> > > I think your workload might change dramatically when you get a > faster pipe. So unless you really feel compelled to change it, I > wouldn''t suggest changing it. > -- richard > >> Loving these X4540''s so far though... >> >> > >Are there any additional tuneables, such as opening a new txg buffer before the previous one is flushed? Or otherwise allow writes to continue without the tick delay? My workload will be pretty consistent, it is going to serve a few roles, which I hope to accomplish in the same units: - large scale backups - cifs share for window app servers - nfs server for unix app servers GigE quickly became the bottleneck, and I imagine 10GigE will add further stress to those write buffers. -- Brent Jones brent at servuhome.net