While trying some things earlier in figuring out how zpool iostat is supposed to be interpreted, I noticed that ZFS behaves kind of weird when writing data. Not to say that it''s bad, just interesting. I wrote 160MB of zeroed data with dd. I had zpool iostat running with an one second interval. dd actually finished before the disk activity started, so I suppose, ZFS does aggressive write caching. However following the iostats, ZFS wrote two seconds at I suppose full speed (roughly 35-40MB/s) and then continued 26.5 seconds at 3.2MB/s. Adding all the values up, I get to the 160MB. I found this interesting. Is this intended? What''s the rationale behind this? Wouldn''t this put huge data writes in jeopardy, if dragged out like this? Thanks. -mg This message posted from opensolaris.org
On 5/8/07, Mario Goebbels <me at tomservo.cc> wrote:> > While trying some things earlier in figuring out how zpool iostat is > supposed to be interpreted, I noticed that ZFS behaves kind of weird when > writing data. Not to say that it''s bad, just interesting. I wrote 160MB of > zeroed data with dd. I had zpool iostat running with an one second interval. > > dd actually finished before the disk activity started, so I suppose, ZFS > does aggressive write caching. However following the iostats, ZFS wrote two > seconds at I suppose full speed (roughly 35-40MB/s) and then continued > 26.5 seconds at 3.2MB/s. Adding all the values up, I get to the 160MB. > > I found this interesting. Is this intended? What''s the rationale behind > this? Wouldn''t this put huge data writes in jeopardy, if dragged out like > this?zfs will interpret zero''d sectors as holes, so wont really write them to disk, they just adjust the file size accordingly. James Dickens uadmin.blogspot.com Thanks.> -mg > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070510/42f7506f/attachment.html>
I''ve noticed a similar behavior in my writes. ZFS seems to write in bursts of around 5 seconds. I assume it''s just something to do with caching? I was watching the drive lights on the T2000s with 3 disk raidz and the disks all blink a couple seconds then are solid for a few seconds. Is this behavior ok? seems it would be better to have the disks writing the whole time instead of in bursts. On my thumper pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- vault1 10.7T 8.32T 108 561 7.23M 24.8M vault1 10.7T 8.32T 108 152 2.68M 5.90M vault1 10.7T 8.32T 143 177 6.49M 11.4M vault1 10.7T 8.32T 147 429 6.59M 27.0M [b]vault1 10.7T 8.32T 111 3.89K 2.84M 131M[/b] vault1 10.7T 8.32T 74 151 460K 6.72M vault1 10.7T 8.32T 103 180 1.71M 7.21M vault1 10.7T 8.32T 119 144 832K 5.69M vault1 10.7T 8.32T 110 185 2.51M 4.75M [b]vault1 10.7T 8.32T 94 2.17K 1.07M 137M vault1 10.7T 8.32T 36 2.87K 354K 24.9M[/b] vault1 10.7T 8.32T 69 140 3.36M 6.00M vault1 10.7T 8.32T 60 177 4.78M 12.9M vault1 10.7T 8.32T 90 198 2.82M 5.22M [b]vault1 10.7T 8.32T 94 1.12K 2.22M 18.1M vault1 10.7T 8.32T 37 3.79K 2.06M 130M[/b] vault1 10.7T 8.32T 88 254 2.43M 10.2M vault1 10.7T 8.32T 137 147 3.64M 7.05M vault1 10.7T 8.32T 307 415 5.84M 9.38M [b]vault1 10.7T 8.32T 132 4.13K 2.26M 158M vault1 10.7T 8.32T 57 1.45K 1.89M 13.2M[/b] vault1 10.7T 8.32T 78 148 577K 8.47M vault1 10.7T 8.32T 17 159 749K 6.26M vault1 10.7T 8.32T 74 248 598K 6.56M [b]vault1 10.7T 8.32T 178 1.20K 1.62M 23.8M vault1 10.7T 8.32T 46 5.23K 1.01M 168M[/b] This message posted from opensolaris.org
On Fri, 2007-05-11 at 09:00 -0700, lonny wrote:> I''ve noticed a similar behavior in my writes. ZFS seems to write in bursts of > around 5 seconds. I assume it''s just something to do with caching?Yep - the ZFS equivalent of fsflush. Runs more often so the pipes don''t get as clogged. We''ve had lots of rain here recently, so I''m sort of sensitive to stories of clogged pipes.> Is this behavior ok? seems it would be better to have the disks writing > the whole time instead of in bursts.Perhaps - although not in all cases (probably not in most cases). Wouldn''t it be cool to actually do some nice sequential writes to the sweet spot of the disk bandwidth curve, but not depend on it so much that a single random I/O here and there throws you for a loop ? Human analogy - it''s often more wise to work smarter than harder :-) Directly to your question - are you seeing any anomalies in file system read or write performance (bandwidth or latency) ? Bob
On May 11, 2007, at 9:09 AM, Bob Netherton wrote: **On Fri, 2007-05-11 at 09:00 -0700, lonny wrote: **I''ve noticed a similar behavior in my writes. ZFS seems to write in bursts of ** around 5 seconds. I assume it''s just something to do with caching? ^Yep - the ZFS equivalent of fsflush. Runs more often so the pipes don''t ^get as clogged. We''ve had lots of rain here recently, so I''m sort of ^sensitive to stories of clogged pipes. ^ **Is this behavior ok? seems it would be better to have the disks writing ** the whole time instead of in bursts. ^ ^Perhaps - although not in all cases (probably not in most cases). ^Wouldn''t it be cool to actually do some nice sequential writes to ^the sweet spot of the disk bandwidth curve, but not depend on it ^so much that a single random I/O here and there throws you for ^a loop ? ^ ^Human analogy - it''s often more wise to work smarter than harder :-) ^ ^Directly to your question - are you seeing any anomalies in file ^system read or write performance (bandwidth or latency) ? ^Bob No performance problems so far, the thumper and zfs seem to handle everything we throw at them. On the T2000 internal disks we were seeing a bottleneck when using a single disk for our apps but moving to a 3 disk raidz alleviated that. The only issue is when using iostat commands the bursts make it a little harder to gauge performance. Is it safe to assume that if those bursts were to reach the upper performance limit that it would spread the writes out a bit more? thanks lonny This message posted from opensolaris.org
Neil.Perrin at Sun.COM
2007-May-11 17:53 UTC
[zfs-discuss] Re: How does ZFS write data to disks?
lonny wrote:> On May 11, 2007, at 9:09 AM, Bob Netherton wrote: > > **On Fri, 2007-05-11 at 09:00 -0700, lonny wrote: > **I''ve noticed a similar behavior in my writes. ZFS seems to write in bursts of > ** around 5 seconds. I assume it''s just something to do with caching? > > ^Yep - the ZFS equivalent of fsflush. Runs more often so the pipes don''t > ^get as clogged. We''ve had lots of rain here recently, so I''m sort of > ^sensitive to stories of clogged pipes. > ^ > **Is this behavior ok? seems it would be better to have the disks writing > ** the whole time instead of in bursts. > ^ > ^Perhaps - although not in all cases (probably not in most cases). > ^Wouldn''t it be cool to actually do some nice sequential writes to > ^the sweet spot of the disk bandwidth curve, but not depend on it > ^so much that a single random I/O here and there throws you for > ^a loop ? > ^ > ^Human analogy - it''s often more wise to work smarter than harder :-) > ^ > ^Directly to your question - are you seeing any anomalies in file > ^system read or write performance (bandwidth or latency) ? > > ^Bob > > > No performance problems so far, the thumper and zfs seem to handle everything we throw at them. On the T2000 internal disks we were seeing a bottleneck when using a single disk for our apps but moving to a 3 disk raidz alleviated that. > > The only issue is when using iostat commands the bursts make it a little harder to gauge performance. Is it safe to assume that if those bursts were to reach the upper performance limit that it would spread the writes out a bit more?The burst of activity every 5 seconds is when the transaction group is committed. Batching up the writes in this way can lead to a number of efficiencies (as Bob hinted). With heavier activity the writes will not get spread out, but will just takes longer. Another way to look at the gaps of IO inactivity is that they indicate underutilisation. Neil.
I think it''s also important to note _how_ one measure performance (which is black magic at the best of times). I personally like to see averages since doing #iostat -xnz 10 doesn''t tell me anything really. Since zfs likes to "bundle and flush" I want my (very expensive ;) Sun storage to give me all it''s got. I''m not too concerned if a 5 second flush gives the disk subsystem a good workout, but when I/O utilization is around 100% with services times of 30+ ms over a period of a hour... then I might want to wheel the drawing board into the architects office. My 2c :)> > The only issue is when using iostat commands the bursts make it a little harder to gauge performance. Is it safe to assume that if those bursts were to reach the upper performance limit that it would spread the writes out a bit more?
Hello James, Thursday, May 10, 2007, 11:12:57 PM, you wrote: > zfs will interpret zero''d sectors as holes, so wont really write them to disk, they just adjust the file size accordingly. It does that only with compression turned on. -- Best regards, Robert mailto:rmilkowski@task.gda.pl http://milek.blogspot.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
writes to ZFS objects have significant data and meta-data implications, based on the zfs copy-on write implementation ... as data is written into a file object, for example, this update must eventually be written to a new location on physical disk, and all of the meta-data (from the uberblock down to this object) must be updated and re-written to a new location as well ... while in cache, the changes to these objects can be consolidated, but once written out to disk, any further changes would make this recent write obsolete and require it all to be written once again to yet another new location on the disk ... batching transactions for 5 seconds (the trigger discussed in zfs documentation) ... is essential to limiting the amount of redundant re-writing that takes place to physical disk ... keeping a disk busy 100% of the time by writing mostly the same data over and over makes far less sense than collecting a group of changes in cache and writing them efficiently every trigger period of time ... even with this optimization, our experience with small, sequential writes (4KB or less) to zvols that have been previously written (to ensure the mapping of real space on the physical disk) for example, show bandwidth values that are less than 10% of comparable larger (128KB or larger) writes ... you can see this behavior dramatically if you compare the amount of host initiated write data (front-end data) to the actual amount of IO performed to the physical disks (both reads and writes) to handle the host''s front-end request ... for example, doing sequential 1MB writes to a (previously written) zvol (simple catenation of 5 FC drives in a JBOD) and writing 2GB of data induced more than 4GB of IO to the drives (with smaller write sizes this ratio gets progressively worse) This message posted from opensolaris.org
Bill Moloney wrote:> for example, doing sequential 1MB writes to a > previously written) zvol (simple catenation of 5 > FC drives in a JBOD) and writing 2GB of data induced > more than 4GB of IO to the drives (with smaller write > sizes this ratio gets progressively worse)How did you measure this? This would imply that rewriting a zvol would be limited at below 50% of disk bandwidth, not something I''m seeing at all. - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts
Robert Milkowski
2007-May-16 22:18 UTC
[zfs-discuss] Re: How does ZFS write data to disks?
Hello Bart, Wednesday, May 16, 2007, 6:07:36 PM, you wrote: BS> Bill Moloney wrote:>> for example, doing sequential 1MB writes to a >> previously written) zvol (simple catenation of 5 >> FC drives in a JBOD) and writing 2GB of data induced >> more than 4GB of IO to the drives (with smaller write >> sizes this ratio gets progressively worse)BS> How did you measure this? This would imply that rewriting BS> a zvol would be limited at below 50% of disk bandwidth, not BS> something I''m seeing at all. Perhaps zvol was created with default 128k block size, then smaller writes were issued. Perhaps lowering volblocksize to 8k or whatever avarage (or constant?) io size he is using would help? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Bill Moloney
2007-May-17 18:21 UTC
[zfs-discuss] Re: Re[2]: Re: How does ZFS write data to disks?
this is not a problem we''re trying to solve, but part of a characterization study of the zfs implementation ... we''re currently using the default 8KB blocksize for our zvol deployment, and we''re performing tests using write block sizes as small as 4KB and as large as 1MB as previously described (including an 8KB write aligned to logical zvol block zero, for a perfect match to the zvol blocksize) ... in all cases we see at least twice the IO to the disks than we generate from our test program (and it''s much worse for smaller write block sizes) ... we''re not exactly caught in read-modify-write hell (except when we write the 4KB blocks that are smaller than the zvol blocksize), it''s more like modify-write hell since the original meta-data that maps the 2GB region we''re writing is probably just read once and kept in cache for the duration of the test ... the large amount of back-end IO is almost entirely write operations, but these write operations include the re-writing of meta-data that has to change to reflect the re-location of newly written data (remember, no in-place writes ever occur for data or meta-data) ... using the default zvol block size of 8KB, zfs requires, in just block-pointer meta-data, about 1.5% of the total 2GB write region (this is a large percentage vs other file systems like ufs, for example, because zfs uses a 128 byte block pointer vs a ufs 8 byte block pointer) ... as new data is written over the old data, the leaves of the meta-data tree are necessarily changed to point to the new locations on disk of the new data, but any new leaf block-pointer requires that a new block of leaf pointers be allocated and written, which requires that the next indirect level up from these leaves point to this new set of leaf pointers, so it must be rewritten itself, and so on up the tree (and remember, meta-data is subject to being written in up to 3 copies - default is 2 - anytime any of it is written to disk) ... the indirect pointer blocks closer to the root of the tree may only see a single pointer change over the course of a 5 second consolidation (based on the size of the zvol, the size of the block allocation unit in the zvol and the amount of data actually written to the zvol in 5 seconds), but a complete new indirect block must be created and written to disk (all the way back to the uberblock) on each transaction group write ... this means that some of these meta-data blocks are written to disk over and over again with only small changes from their previous composition ... consolidating for more than 5 seconds would help to mitigate this situation, but longer consolidation periods put more data at risk of being lost in case of a power failure ... this is not particularly a problem, just a manifestation of the need to never write in-place, a rather large block pointer size and the possible writing of multiple copies of meta-data (of course this block pointer carries check sums, and the addresses of up to 3 duplicate blocks, providing the excellent data and meta-data protection zfs is so well known for) ... the original thread that this reply addressed was the characteristic 5 second delay in writes, which I tried to explain in the context of copy-on-write consolidation, but it''s clear that even this delay cannot prevent the modification and re-writing of the same basic meta-data many times with small modifications This message posted from opensolaris.org