Can someone please describe to me the actual underlying I/O operations which occur when a 128K block of data is written to a storage pool configured as shown below (with default ZFS block sizes)? I am particularly interested in the degree of "striping" across mirrors which occurs. This would be for Solaris 10 U4. NAME STATE READ WRITE CKSUM Sun_2540 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B80003A8A0B0000096A47B4559Ed0 ONLINE 0 0 0 c4t600A0B800039C9B500000AA047B4529Bd0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B80003A8A0B0000096E47B456DAd0 ONLINE 0 0 0 c4t600A0B800039C9B500000AA447B4544Fd0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B80003A8A0B0000096147B451BEd0 ONLINE 0 0 0 c4t600A0B800039C9B500000AA847B45605d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B80003A8A0B0000096647B453CEd0 ONLINE 0 0 0 c4t600A0B800039C9B500000AAC47B45739d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B80003A8A0B0000097347B457D4d0 ONLINE 0 0 0 c4t600A0B800039C9B500000AB047B457ADd0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B800039C9B500000A9C47B4522Dd0 ONLINE 0 0 0 c4t600A0B800039C9B500000AB447B4595Fd0 ONLINE 0 0 0 Thanks, Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn wrote:> Can someone please describe to me the actual underlying I/O operations > which occur when a 128K block of data is written to a storage pool > configured as shown below (with default ZFS block sizes)? I am > particularly interested in the degree of "striping" across mirrors > which occurs. This would be for Solaris 10 U4. >My observation, is that each metaslab is, by default, 1 MByte in size. Each top-level vdev is allocated by metaslabs. ZFS tries to allocate a top-level vdev''s metaslab before moving onto another one. So you should see eight 128kByte allocs per top-level vdev before the next top-level vdev is allocated. That said, the actual iops are sent in parallel. So it is not unusual to see many, most, or all of the top-level vdevs concurrently busy. Does this match your experience? -- richard> > NAME STATE READ WRITE CKSUM > Sun_2540 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B80003A8A0B0000096A47B4559Ed0 ONLINE 0 0 0 > c4t600A0B800039C9B500000AA047B4529Bd0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B80003A8A0B0000096E47B456DAd0 ONLINE 0 0 0 > c4t600A0B800039C9B500000AA447B4544Fd0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B80003A8A0B0000096147B451BEd0 ONLINE 0 0 0 > c4t600A0B800039C9B500000AA847B45605d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B80003A8A0B0000096647B453CEd0 ONLINE 0 0 0 > c4t600A0B800039C9B500000AAC47B45739d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B80003A8A0B0000097347B457D4d0 ONLINE 0 0 0 > c4t600A0B800039C9B500000AB047B457ADd0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B800039C9B500000A9C47B4522Dd0 ONLINE 0 0 0 > c4t600A0B800039C9B500000AB447B4595Fd0 ONLINE 0 0 0 > > > Thanks, > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On Sat, 15 Mar 2008, Richard Elling wrote:> > My observation, is that each metaslab is, by default, 1 MByte in size. Each > top-level vdev is allocated by metaslabs. ZFS tries to allocate a top-level > vdev''s metaslab before moving onto another one. So you should see eight > 128kByte allocs per top-level vdev before the next top-level vdev is > allocated. > > That said, the actual iops are sent in parallel. So it is not unusual to see > many, most, or all of the top-level vdevs concurrently busy. > > Does this match your experience?I do see that all the devices are quite evenly busy. There is no doubt that the load balancing is quite good. The main question is if there is any actual "striping" going on (breaking the data into smaller chunks), or if the algorithm is simply load balancing. Striping trades IOPS for bandwidth. Using my application, I did some tests today. The application was used to do balanced read/write of about 500GB of data in some tens of thousand of reasonably large files. The application sequentially reads a file, then sequentially writes a file. Several copies (2-6) of the application were run at once for concurrency. What I noticed is that with hardly any CPU being used, the read+write bandwidth seemed to be bottlenecked at about 280MB/second with ''zfs iostat'' showing very balanced I/O between the reads and the writes. The system I set up is performing quite a bit differently than I anticipated. The I/O is bottlenecked and I find that my application can do significant processing of the data without significantly increasing the application run time. So CPU time is almost free. If I was to assign a smaller block size for the filesystem, would that provide more of the benefits of striping or would it be detrimental to performance due to the number of I/Os? Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> I do see that all the devices are quite evenly busy. There is no > doubt that the load balancing is quite good. The main question is if > there is any actual "striping" going on (breaking the data into > smaller chunks), or if the algorithm is simply load balancing. > Striping trades IOPS for bandwidth.There''s no striping across vdevs going on. It''s simple load balancing, i.e. blocks are spread semi-randomly across the vdevs. I say semi because apparently the average bandwidth load of the vdevs influence the outcome. -mg -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 217 bytes Desc: OpenPGP digital signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080316/0930230d/attachment.bin>
Bob Friesenhahn wrote:> On Sat, 15 Mar 2008, Richard Elling wrote: >> >> My observation, is that each metaslab is, by default, 1 MByte in >> size. Each >> top-level vdev is allocated by metaslabs. ZFS tries to allocate a >> top-level >> vdev''s metaslab before moving onto another one. So you should see eight >> 128kByte allocs per top-level vdev before the next top-level vdev is >> allocated. >> >> That said, the actual iops are sent in parallel. So it is not >> unusual to see >> many, most, or all of the top-level vdevs concurrently busy. >> >> Does this match your experience? > > I do see that all the devices are quite evenly busy. There is no > doubt that the load balancing is quite good. The main question is if > there is any actual "striping" going on (breaking the data into > smaller chunks), or if the algorithm is simply load balancing. > Striping trades IOPS for bandwidth.By my definition of striping, yes it is going on. But there are different ways to spread the data. The way that writes are handled, ZFS rewards devices which can provide good sequential write bandwidth, like disks. Reads are another story, they read from where the data is, which in turn depends on the conditions at write time. The other behaviour you may see is that reads and writes are coalesced, when possible. At the device level you may see your smaller blocks being coalesced into larger iops.> > Using my application, I did some tests today. The application was > used to do balanced read/write of about 500GB of data in some tens of > thousand of reasonably large files. The application sequentially > reads a file, then sequentially writes a file. Several copies (2-6) > of the application were run at once for concurrency. What I noticed > is that with hardly any CPU being used, the read+write bandwidth > seemed to be bottlenecked at about 280MB/second with ''zfs iostat'' > showing very balanced I/O between the reads and the writes.But where is the bottleneck? iostat will show bottlenecks in the physical disks and channels. vmstat or mpstat will show the bottlenecks in cpus. To see if the app is the bottleneck will require some analysis of the app itself. Is it spending its time blocked on I/O?> > The system I set up is performing quite a bit differently than I > anticipated. The I/O is bottlenecked and I find that my application > can do significant processing of the data without significantly > increasing the application run time. So CPU time is almost free. > > If I was to assign a smaller block size for the filesystem, would that > provide more of the benefits of striping or would it be detrimental to > performance due to the number of I/Os?I would not expect to see much difference, but the proof is in the pudding. Let us know what you find. -- richard
On Sun, 16 Mar 2008, Richard Elling wrote:> > But where is the bottleneck? iostat will show bottlenecks in the > physical disks and channels. vmstat or mpstat will show the > bottlenecks in cpus. To see if the app is the bottleneck will > require some analysis of the app itself. Is it spending its time > blocked on I/O?The application is spending almost all the time blocked on I/O. I see that the number of device writes per second seems pretty high. The application is doing I/O in 128K blocks. How many IOPS does a modern 300GB 15K RPM SAS drive typically deliver? Of course the IOPS capacity depends on if the access is random or sequential. At the application level, the access is completely sequential but ZFS is likely doing some extra seeks. iostat output (atime=off): extended device statistics device r/s w/s Mr/s Mw/s wait actv svc_t %w %b sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd1 0.0 0.0 0.0 0.0 0.0 0.0 2.8 0 0 sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd10 80.4 170.7 10.0 19.9 0.0 9.2 36.5 0 54 sd11 82.1 170.2 10.2 20.0 0.0 13.3 52.9 0 71 sd12 79.3 168.3 9.9 20.0 0.0 13.1 53.1 0 69 sd13 80.6 173.0 10.0 19.9 0.0 9.3 36.7 0 56 sd14 80.9 167.8 10.1 20.0 0.0 13.4 53.8 0 70 sd15 77.7 168.7 9.7 19.9 0.0 9.1 37.1 0 52 sd16 77.3 170.6 9.6 20.0 0.0 13.3 53.7 0 70 sd17 76.4 168.2 9.5 20.0 0.0 9.1 37.2 0 52 sd18 76.7 172.2 9.5 19.9 0.0 13.5 54.2 0 70 sd19 83.8 173.2 10.4 20.0 0.0 13.7 53.4 0 74 sd20 73.3 174.3 9.1 20.0 0.0 9.1 36.9 0 56 sd21 75.3 170.2 9.4 20.0 0.0 13.2 53.9 0 69 nfs1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 % mpstat CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 288 1 189 1018 413 815 26 102 88 0 3046 3 3 0 94 1 185 1 180 634 1 830 43 111 74 0 3117 3 2 0 94 2 284 1 183 521 6 617 27 98 67 0 4954 4 3 0 93 3 176 1 239 748 353 555 25 76 62 0 3933 4 3 0 93 Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Hi Bob ... as richard has mentioned, allocation to vdevs is done in a fixed sized chunk (richard specs 1MB, but I remember a 512KB number from the original spec, but this is not very important), and the allocation algorithm is basically doing load balancing. for your non-raid pool, this chunk size will stay fixed regardless of the block size you choose when creating the file system or the IO unit size your applications(s) use. (The stripe size can dynamically change in a raidz pool, but not in your non-raid pool.) Measuring bandwidth for you application load is tricky with ZFS, since there are many hidden IO operations (besides the ones that your application is requesting) that ZFS must perform. If you collect iostats on bytes transferred to hard drives and compare those numbers to the amount of data your application(s) transferred you can find potentially large differences. The differences in these scenarios are largely driven by the IO size your application(s) use. For example, when I run the following tests here are my observations: -using dual xeon server with qlogic FC 2G interface -using a pool with 5 10Krpm FC 146 GB drives -sequentially writing 4 15GB previously wriiten files in one file system in the pool (this file system is using 128KB block size), and a separate thread writing each file concurrently for a total of 60GB written block size written actual written disk IO observed BW MB/S %CPU 4KB 60GB 227.3GB 34.2 20.4 32KB 60GB 216.5GB 36.1 13.9 128KB 60Gb 63.6GB 69.6 31.0 You can see that a small application IO size causes much meta-data based IO (more than 3 times the actual application IO requirements), while the 128KB application writes induce only marginally more disk IO than the application actually uses. the BW numbers here are for just the application data, but when you consider all the IO from the disks over the test times, the physical BW is obviously greater in all cases. All my drives were uniformly busy in these tests, but the small application IO sizes forced much more total IO against the drives. In your case the application IO rate would be even further degraded due to the mirror configuration. The extra load of reading and writing meta-data (including ditto-blocks) and mirror devices conspire to reduce the application IO rate, even though the disk device IO rates may be quite good. File system block size reduction only exacerbates the problem by requiring more meta-data to support the same quantity of application data, and for sequential IO this is a loser. In any case, for a non-raid pool, the allocation chunk size per drive (the stripe size) is not influenced by file system block size. When application IO sizes get small, the overhead in ZFS goes up dramatically. regards, Bill> The application is spending almost all the time > blocked on I/O. I see > that the number of device writes per second seems > pretty high. The > application is doing I/O in 128K blocks. How many > IOPS does a modern > 300GB 15K RPM SAS drive typically deliver? Of course > the IOPS > capacity depends on if the access is random or > sequential. At the > application level, the access is completely > sequential but ZFS is > likely doing some extra seeks.This message posted from opensolaris.org
On Wed, 19 Mar 2008, Bill Moloney wrote:> When application IO sizes get small, the overhead in ZFS goes > up dramatically.Thanks for the feedback. However, from what I have observed, it is not a full story at all. On my own system, when a new file is written, the write block size does not make a significant difference to the write speed. Similarly, read block size does not make a significant difference to the sequential read speed. I do see a large difference in rates when an existing file is updated sequentially. There is a many orders of magnitude difference for random I/O type updates. I think that there some rather obvious reasons for the difference between writing a new file, or updating an existing file. When writing a new file, the system can buffer up to a disk block''s worth of size prior to issuing a a disk I/O, or it can immedialy write what it has and since the write is sequential, it does not need to re-read prior to write (but there may be more metadata I/Os). For the case of updating part of a disk block, there needs to be a read prior to write if the block is not cached in RAM. If the system is short on RAM, it may be that ZFS issues many more write I/Os than if it has a lot of RAM. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> On my own system, when a new file is > written, the write block size does not make > a significant difference to the write speedYes, I''ve observed the same result ... when a new file is being written sequentially, the file data and newly constructed meta-data can be built in cache and written in large sequential chunks periodically, without the need to read in existing meta-data and/or data. It seems that data and meta-data that are newly constucted in cache for sequential operations will persist in cache effectively, and the application IO size is a much less sensitive parameter. Monitoring disks with iostat in these cases shows the disk IO to be only marginally greater than the application IO. This is why I specified that the write tests described in my previous post were to existing files. The overhead of doing small sequential writes to an existing object is so much greater than writing to a new object, that it begs for some reasonable explanation. The only one that i''ve been able assemble in various experimentation, is that data/meta-data for existing objects is not retained effetively in cache if ZFS detects that such an object is being sequntially written. This forces the constant re-reading of the data/meta-data associated with such an object, causing a huge increase in device IO traffic that does not seem to accompany the writing of a brand new object. The size of RAM seems to make little difference in this case. As small sequential writes accumulate in the 5 second cache, the chain of meta-data leading to the newly constructed data block may see only one pointer (of the 128 in the final set) changing to point to this newly constructed data block, but all the meta-data from the uber block to the target must be rewritten on the 5 second flush. Of course this is not much diffrent from what''s happening in the newly created object scenario, so it must be the behavior that follows this flush that''s different. It seems to me that after this flush, some, or all of the data/meta-data that will be affected next is re-read even though much of what''s needed for subsequent operations should already be in cache. My experience with large RAM systems and with the use of SSDs as ZFS cache devices has convinced me that data/meta-data associated with sequential write operations to existing objects (and ZFS seems very good at detecting this association) does not get retained in cache very effectively. You can see this very clearly if you look at the IO to a cache device (ZFS allows you to easily attach a device to a pool as a cache device which acts as a sort of L2 type cache for RAM). When I do random IO operations to existing objects I see a large amount of IO to my cache device as RAM fills and ZFS pushes cached information (that would otherwise be evicted) to the SSD cache device. If I repeat the random IO test over the same total file space I see improved performance as I get occassional hits from the RAM cache and the SSD cache. As this extended cache heirarchy warms up with each test run, my results continue to improve. If I run sequential write operations to exiting objects however, I see very little activity to my SSD cache, and virtually no change in performance when I immediately run the same test again. It seems that ZFS is still in need of some fine-tuning for small sequential write operations to exiting objects. regards, Bill This message posted from opensolaris.org
Hello Bob, Wednesday, March 19, 2008, 11:23:58 PM, you wrote: BF> On Wed, 19 Mar 2008, Bill Moloney wrote:>> When application IO sizes get small, the overhead in ZFS goes >> up dramatically.BF> Thanks for the feedback. However, from what I have observed, it is BF> not a full story at all. On my own system, when a new file is BF> written, the write block size does not make a significant difference BF> to the write speed. Similarly, read block size does not make a BF> significant difference to the sequential read speed. I do see a BF> large difference in rates when an existing file is updated BF> sequentially. There is a many orders of magnitude difference for BF> random I/O type updates. BF> I think that there some rather obvious reasons for the difference BF> between writing a new file, or updating an existing file. When BF> writing a new file, the system can buffer up to a disk block''s worth BF> of size prior to issuing a a disk I/O, or it can immedialy write what BF> it has and since the write is sequential, it does not need to re-read BF> prior to write (but there may be more metadata I/Os). For the case of BF> updating part of a disk block, there needs to be a read prior to write BF> if the block is not cached in RAM. Possibly when you created a file zfs used 128KB blocks. Then if you randomly update that file the question is - what is an average update size? If it''s below 128KB (and not aligned) you will basically have to read the old 128KB block first and then write it to new location. In such scenario (like oracle databases on zfs) BEFORE you create files set recordsize property to something smaller, ideally match your avg. update size. In case of Oracle matching db_block_size should give you best results most the times. -- Best regards, Robert Milkowski mailto:milek at task.gda.pl http://milek.blogspot.com
> Similarly, read block size does not make a > significant difference to the sequential read speed.Last time I did a simple bench using dd, supplying the record size as blocksize to it instead of no blocksize parameter bumped the mirror pool speed from 90MB/s to 130MB/s. -mg -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 217 bytes Desc: OpenPGP digital signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080320/b6e539c8/attachment.bin>
On Thu, 20 Mar 2008, Mario Goebbels wrote:>> Similarly, read block size does not make a >> significant difference to the sequential read speed. > > Last time I did a simple bench using dd, supplying the record size as > blocksize to it instead of no blocksize parameter bumped the mirror pool > speed from 90MB/s to 130MB/s.Indeed. However, as an interesting twist to things, in my own benchmark runs I see two behaviors. When the file size is smaller than the amount of RAM the ARC can reasonably grow to, the write block size does make a clear difference. When the file size is larger than RAM, the write block size no longer makes much difference and sometimes larger block sizes actually go slower. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Mar 20, 2008, at 11:07 AM, Bob Friesenhahn wrote:> On Thu, 20 Mar 2008, Mario Goebbels wrote: > >>> Similarly, read block size does not make a >>> significant difference to the sequential read speed. >> >> Last time I did a simple bench using dd, supplying the record size as >> blocksize to it instead of no blocksize parameter bumped the mirror >> pool >> speed from 90MB/s to 130MB/s. > > Indeed. However, as an interesting twist to things, in my own > benchmark runs I see two behaviors. When the file size is smaller > than the amount of RAM the ARC can reasonably grow to, the write block > size does make a clear difference. When the file size is larger than > RAM, the write block size no longer makes much difference and > sometimes larger block sizes actually go slower.in that case .. try fixing the ARC size .. the dynamic resizing on the ARC can be less than optimal IMHO --- .je
On Thu, 20 Mar 2008, Jonathan Edwards wrote:> > in that case .. try fixing the ARC size .. the dynamic resizing on the ARC > can be less than optimal IMHOIs a 16GB ARC size not considered to be enough? ;-) I was only describing the behavior that I observed. It seems to me that when large files are written very quickly, that when the file becomes bigger than the ARC, that what is contained in the ARC is mostly stale and does not help much any more. If the file is smaller than the ARC, then there is likely to be more useful caching. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> Is a 16GB ARC size not considered to be enough? ;-) > > I was only describing the behavior that I observed. It seems to me > that when large files are written very quickly, that when the file > becomes bigger than the ARC, that what is contained in the ARC is > mostly stale and does not help much any more. If the file is smaller > than the ARC, then there is likely to be more useful caching.That''s the problem of grouping writes, since they''re going to be buffered in the ARC. Writing a file larger than the ARC is akin to holding a powerful firehose on it. :) I guess what''d help your case would be an option that specifies the minimum of ARC memory dedicated to read caching only. -mg
On Mar 20, 2008, at 2:00 PM, Bob Friesenhahn wrote:> On Thu, 20 Mar 2008, Jonathan Edwards wrote: >> >> in that case .. try fixing the ARC size .. the dynamic resizing on >> the ARC >> can be less than optimal IMHO > > Is a 16GB ARC size not considered to be enough? ;-) > > I was only describing the behavior that I observed. It seems to me > that when large files are written very quickly, that when the file > becomes bigger than the ARC, that what is contained in the ARC is > mostly stale and does not help much any more. If the file is smaller > than the ARC, then there is likely to be more useful caching.sure i got that - it''s not the size of the arc in this case since caching is going to be a lost cause.. but explicitly setting a zfs_arc_max should result in fewer calls to arc_shrink() when you hit memory pressure between the application''s page buffer competing with the arc in other words, as soon as the arc is 50% full of dirty pages (8GB) it''ll start evicting pages .. you can''t avoid that .. but what you can avoid is the additional weight of constantly growing and shrinking the cache as it tries to keep up with your constantly changing blocks in a large file --- .je