Hi all, First, kudos to all the ZFS folks for a killer technology. We use several Sun 7000 series boxes at work and love the features. I recently decided to build an Opensolaris server for home. I just put the box together over the weekend. It is using an LSI 1068E based HBA (Supermicro FWIW) and 8 2TB WD drives in a single raidz2 pool. It is a clean install of snv_128a with the only changes from vanilla being to install the CIFS server packages and create and share a CIFS share. I started copying over all the data from my existing workstation. When copying files (mostly multi-gigabyte DV video files), network throughput drops to zero for ~1/2 second every 8-15 seconds. This throughput drop corresponds to drive activity on the Opensolaris box. The ZFS pool drives show no activity except every 8-15 seconds. As best as I can guess, the Opensolaris box is caching traffic and batching it to disk every so often. I guess I didn''t expect disk writes to interrupt network traffic. Is this correct? One other item to note, the pool is currently degraded as one of the drives was apparently damaged during shipping and died almost immediately after I created my pool. I completely removed this drive to RMA it. I''d be happy to provide any info needed. Thanks in advance. Richard Bruce -- This message posted from opensolaris.org
On Mon, 7 Dec 2009, Richard Bruce wrote:> I started copying over all the data from my existing workstation. > When copying files (mostly multi-gigabyte DV video files), network > throughput drops to zero for ~1/2 second every 8-15 seconds. This > throughput drop corresponds to drive activity on the Opensolaris > box. The ZFS pool drives show no activity except every 8-15 > seconds. As best as I can guess, the Opensolaris box is caching > traffic and batching it to disk every so often. I guess I didn''t > expect disk writes to interrupt network traffic. Is this correct?This is expected behavior. From what has been posted here, these are the current buffering rules: up to 7/8ths of available memory up to 5 seconds worth of 100% write I/O time up to 30 seconds without a write and if you don''t like it, you can use the zfs:zfs_arc_max tunable in /etc/system to set a maximum amount of memory to be used prior to a write. This may be useful on systems with a large amount of memory and which want to limit the maximum delay time due to committing the zfs transation group. There will still be interruptions, but the interruptions can be made briefer (and more often). Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Mon, 7 Dec 2009, Bob Friesenhahn wrote:> and if you don''t like it, you can use the zfs:zfs_arc_max tunable in > /etc/system to set a maximum amount of memory to be used prior to a write.Oops. Bad cut-n-paste. That should have been zfs:zfs_write_limit_override So I am currently using * Set ZFS maximum TXG group size to 3932160000 set zfs:zfs_write_limit_override = 0xea600000 Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
>> and if you don''t like it, you can use the zfs:zfs_arc_max tunable in >> /etc/system to set a maximum amount of memory to be used prior to a write. > > Oops. ?Bad cut-n-paste. That should have been > > ?zfs:zfs_write_limit_override > > So I am currently using > > * Set ZFS maximum TXG group size to 3932160000 > set zfs:zfs_write_limit_override = 0xea600000I have a DAS array with nvram so I enabled zfs_nocacheflush = 1 and it made a world of difference in performance. Does the LSI HBA have any nvram to make this tuning acceptable? Is this setting acceptable as I understood the Evil Tuning Guide?
Bob, Thanks for your help. I thought that I might have seen something about this in the past but couldn''t remember for sure. Thanks for pointing me in the right direction.>From the URL below, it states that each TXG will be limited to 1/8th of the physical memory (this differs from the 7/8ths of available memory you referenced).http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle In my current config that would yield a TXG size of 512MB (4GB of system RAM/8) and makes sense for ~1 second write times. At the moment this is not causing any real issues; I just hadn''t expected the system to pause network traffic while committing data to disk. Is there a tunable to bump up the 1-tick delay that is enforced on writing threads when we trigger the first threshold (7/8ths of TXG commit threshold as per above URL) or change the threshold at which we start enforcing tick delays? This seems to me it would help to smooth out I/O better than just limited the TXG size. Richard -- This message posted from opensolaris.org
On 7 dec 2009, at 18.40, Bob Friesenhahn wrote:> On Mon, 7 Dec 2009, Richard Bruce wrote: > >> I started copying over all the data from my existing workstation. When copying files (mostly multi-gigabyte DV video files), network throughput drops to zero for ~1/2 second every 8-15 seconds. This throughput drop corresponds to drive activity on the Opensolaris box. The ZFS pool drives show no activity except every 8-15 seconds. As best as I can guess, the Opensolaris box is caching traffic and batching it to disk every so often. I guess I didn''t expect disk writes to interrupt network traffic. Is this correct? > > This is expected behavior. From what has been posted here, these are the current buffering rules:Is it really? Shouldn''t it start on the next txg and while the previous txg commits, and just continue writing? /ragge
On Wed, 9 Dec 2009, Ragnar Sundblad wrote:>> >> This is expected behavior. From what has been posted here, these >> are the current buffering rules: > > Is it really? > > Shouldn''t it start on the next txg and while the previous txg commits, > and just continue writing?The pause is clearly not during the entire TXG commit. The TXG commit could take up to five seconds to complete. Perhaps the pause occurs only during the start of the commit, or perhaps it is at the end, or perhaps it is because the next TXG has already become 100% full while waiting for the current TXG to commit, and zfs is not willing to endanger more than one TXG worth of data so it pauses? To my recollection, none of the zfs developers have been interested in discussing the cause of the pause, although they are clearly interested in maximizing performance. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/