Guys, Have an opensolairs x86 box running: SunOS thsudfile01 5.11 snv_111b i86pc i386 i86pc Solaris This has 2 old qla2200 1Gbit FC cards attached. Each bus is connected to an old transtec F/C raid array. This has a couple of large luns that form a single large zpool: root at thsudfile01:~# zpool status bucket pool: bucket state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM bucket ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 errors: No known data errors root at thsudfile01:~# zfs list bucket NAME USED AVAIL REFER MOUNTPOINT bucket 2.69T 5.31T 22K /bucket This is being used as an iSCSI target for an ESX 4.0 development environemnt. I found the performance to be really poor and found the culprit seems to be write performance to the raw zvol. For example on this zfs filesystem allocated as a volume: root at thsudfile01:~# zfs list bucket/iSCSI/lun1 NAME USED AVAIL REFER MOUNTPOINT bucket/iSCSI/lun1 250G 5.55T 3.64G - root at thsudfile01:~# dd if=/dev/zero of=/dev/zvol/rdsk/bucket/iSCSI/lun1 bs=65536 count=102400 ^C7729+0 records in 7729+0 records out 506527744 bytes (507 MB) copied, 241.707 s, 2.1 MB/ Some zpool iostat 1 1000: bucket 2.44T 5.68T 0 203 0 2.73M bucket 2.44T 5.68T 0 216 0 2.83M bucket 2.44T 5.68T 0 120 63.4K 1.58M bucket 2.44T 5.68T 2 350 190K 16.9M bucket 2.44T 5.68T 0 123 0 1.64M bucket 2.44T 5.68T 0 230 0 3.02M Read performance from that zvol (assuming /dev/null behaves properly) is fine: root at thsudfile01:/bucket/transtec# dd of=/dev/null if=/dev/zvol/rdsk/bucket/iSCSI/lun1 bs=65536 count=204800 204800+0 records in 204800+0 records out 13421772800 bytes (13 GB) copied, 47.0256 s, 285 MB/s Somewhat optimistic that... but iostat shows 100MB/s ish. Write to a zfs filesystem from that zpool is also fine, here with a a write big enough to exhaust the machines 12GB memory: root at thsudfile01:/bucket/transtec# dd if=/dev/zero of=FILE bs=65536 count=409600 ^C 336645+0 records in 336645+0 records out 22062366720 bytes (22 GB) copied, 176.369 s, 125 MB/s and bursts of cache flush from iostat: bucket 2.44T 5.68T 0 342 0 38.7M bucket 2.44T 5.68T 0 1.47K 0 188M bucket 2.44T 5.68T 0 240 0 21.3M bucket 2.44T 5.68T 0 1.54K 0 191M bucket 2.44T 5.68T 0 1.49K 0 191M bucket 2.44T 5.68T 0 434 0 44.2M So we seem to be able to get data down to disk via the cache at a reasonable rate and read from a raw zvol OK, but writes are horribly slow. Am I missing something obvious? let me know what info would be diagnostic and I''ll post it... Cheers, Leon -- This message posted from opensolaris.org
Do you have any older benchmarks on these cards and arrays (in their pre-ZFS life?) Perhaps this is not a ZFS regression but a hardware config issue? Perhaps there''s some caching (like per-disk write-through) not enabled on the arrays? As you may know, the ability (and reliability) of such cache often depends on presence of battery back-up (BBU) modules on an array controller. It is also often pointed out that ZFS prefers raw disks instead of hardware vendors'' RAID arrays (with possible errors crawling into data bits and undetected by the firmware''s algorithms unless the whole disk breaks, or errors related to bugs in the firmware itself - in more elaborate conspiracy theories). Due to this it is often recommended to use external RAID implementations as vdev''s to a redundant ZFS pool (such as a mirror of two equivalent arrays). //Jim -- This message posted from opensolaris.org
Hmm, scratch that. Maybe. I did not first get the point that your writes to a filesystem dataset work quickly. Perhaps filesystem is (better) cached indeed, i.e. *maybe* zvol writes are synchronous and zfs writes may be cached and thus async? Try playing around with relevant dataset attributes... I''m running a test on my system (a snv_114 Thumper, 16Gb RAM, used for other purposes as well), the CPU is mostly idle now (2.5-3.2% kernel time, that''s about it). Seems I have results not unlike yours. Not cool because I wanted to play with COMSTAR iSCSI - and I''m not sure it will perform well ;) I''m dd''ing 30Gb to an uncompressed test zvol with same 64kb block sizes (maybe they are too small?), and zpool iostat goes like this - a hundred IOs at 7Mbps for a minute, then a burst of 100-170Mbps and 20-25K IOps for a second: pond 5.79T 4.41T 0 106 0 7.09M pond 5.79T 4.41T 0 1.93K 0 20.7M pond 5.79T 4.41T 0 13.3K 0 106M pond 5.79T 4.41T 0 116 0 7.76M pond 5.79T 4.41T 0 108 0 7.23M pond 5.79T 4.41T 0 107 0 7.16M pond 5.79T 4.41T 0 107 0 7.16M or pond 5.79T 4.41T 0 117 0 7.83M pond 5.79T 4.41T 0 5.61K 0 49.7M pond 5.79T 4.41T 0 19.0K 504 149M pond 5.79T 4.41T 0 104 0 6.96M Weird indeed. It wrote 10Gb (according to "zfs get usedbydataset pond/test") taking roughly 30 minutes after which I killed it. Now, writing to an uncompressed filesystem dataset (although very far from what''s trumpeted as Thumper performance) yields quite different numbers: pond 5.80T 4.40T 1 3.64K 1022 457M pond 5.80T 4.40T 0 866 967 75.7M pond 5.80T 4.40T 0 4.65K 0 586M pond 5.80T 4.40T 6 802 33.4K 69.2M pond 5.80T 4.40T 29 2.44K 1.10M 301M pond 5.80T 4.40T 32 691 735K 25.0M pond 5.80T 4.40T 56 1.59K 2.29M 184M pond 5.80T 4.40T 150 768 4.61M 10.5M pond 5.80T 4.40T 2 0 25.5K 0 pond 5.80T 4.40T 0 2.75K 0 341M pond 5.80T 4.40T 7 3.96K 339K 497M pond 5.80T 4.39T 85 740 3.57M 59.0M pond 5.80T 4.39T 67 0 2.22M 0 pond 5.80T 4.39T 9 4.67K 292K 581M pond 5.80T 4.39T 4 1.07K 126K 137M pond 5.80T 4.39T 27 333 338K 9.15M pond 5.80T 4.39T 5 0 28.0K 3.99K pond 5.82T 4.37T 1 5.42K 1.67K 677M pond 5.83T 4.37T 3 1.69K 8.36K 173M pond 5.83T 4.37T 2 0 5.49K 0 pond 5.83T 4.37T 0 6.32K 0 790M pond 5.83T 4.37T 2 290 7.95K 27.8M pond 5.83T 4.37T 0 9.64K 1.23K 1.18G The numbers are jumpy (maybe due to fragmentation, other processes, etc.) but there are often spikes in excess of 500MBps. The whole test took a relatively little time: # time dd if=/dev/zero of=/pond/tmpnocompress/test30g bs=65536 count=500000 500000+0 records in 500000+0 records out real 1m27.657s user 0m0.302s sys 0m46.976s # du -hs /pond/tmpnocompress/test30g 30G /pond/tmpnocompress/test30g To detail about the pool: The pool is on a Sun X4500 with 48 250Gb SATA drives. It was created as a 9x5 set (9 stripes made of 5-disk raidz1 vdevs) spread across different controllers, with the command: # zpool create -f pond \ raidz1 c0t0d0 c1t0d0 c4t0d0 c6t0d0 c7t0d0 \ raidz1 c0t1d0 c1t2d0 c4t3d0 c6t5d0 c7t6d0 \ raidz1 c1t1d0 c4t1d0 c5t1d0 c6t1d0 c7t1d0 \ raidz1 c0t2d0 c4t2d0 c5t2d0 c6t2d0 c7t2d0 \ raidz1 c0t3d0 c1t3d0 c5t3d0 c6t3d0 c7t3d0 \ raidz1 c0t4d0 c1t4d0 c4t4d0 c6t4d0 c7t4d0 \ raidz1 c0t5d0 c1t5d0 c4t5d0 c5t5d0 c7t5d0 \ raidz1 c0t6d0 c1t6d0 c4t6d0 c5t6d0 c6t6d0 \ raidz1 c1t7d0 c4t7d0 c5t7d0 c6t7d0 c7t7d0 \ spare c0t7d0 Alas, while there were many blogs, I couldn''t find a definitive answer last year as to which Thumper layout is optimal in performance and/or reliability (in regard to 6 controllers of 8 disks each, with 2 disks on one of the controllers reserved for booting). As a result, we spread each raidz1 across 5 controllers, so the loss of one controller should have minimal impact on data loss on the average. Since the system layout is not symmetrical, some controllers are more important than others (say, the boot one). //Jim -- This message posted from opensolaris.org
To tell the truth, I expected zvols to be faster than filesystem datasets. They seem to have less overhead without inodes, posix, acls and so on. So I''m puzzled by test results. I''m now considering the dd i/o block size, and it means a lot indeed, especially if compared to zvol results with small blocks like 64k. I ran a number of tests with a zvol recreated by commands before each run (this may however cause varying fragmentation impacting results of different runs): # zfs destroy -r pond/test; zfs create -V 30G pond/test; zfs set compression=off pond/test; sync; dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1000 bs=512; sync and tests going like # time dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1024 bs=1048576 1024+0 records in 1024+0 records out real 0m42.442s user 0m0.006s sys 0m4.292s The test progresses were quite jumpy (with "zpool iostat pond 1" values varying from 30 to 70 MBps, reads coming in sometimes). So I''d stick to overall result - the rounded wallclock time it takes to write 1024 records of varying size and resulting average end-user MBps. I also write "sys" time since that''s what is consumed by the kernel and the disk subsystem, after all. I don''t write zpool iostat speeds, since they vary too much and I don''t bother with a spreadsheen right now. But the reported values stay about halfway between "wallclock MBps" ans "sys MBps" calculations, on the perceived average, peaking at about 350MBps for large block sizes (>4MB). 1 MB (bs=1048576): 42s (4s), 24MBps 4 MB (bs=4194304): 42s (15s), 96MBps 16MB (bs=16777216): 129s-148s (62-64s), 127-110MBps 32MB (bs=33554432, 40Gb zvol): 303s (127s), 108MBps Similar results for writing a file to a filesystem; "zpool iostat" values again jumped anywhere between single MBps to GBps. Simple cleanups used like: # rm /pool/test30g; sync; time dd if=/dev/zero of=/pool/test30g count=1024 bs=33554432 Values remain somewhat consistent (in the same league, at least): 1 MB (bs=1048576, 10240 blocks): 20-21s (7-8s), 512-487MBps 1 MB (bs=1048576): 2.3s (0.6s), 445MBps 4 MB (bs=4194304): 8s (3s), 512MBps 16MB (bs=16777216): 37s (15s), 442MBps 32MB (bs=33554432): 74-103s (32-42s), 442-318MBps 64Kb (bs=65536, 545664 blocks): 94s (47s), 362MBps All in all, to make more precise results these tests should be made in greater numbers and averaged. But here we got some figures to think about... On a side note, now I''ll pay more attention to tuning suggestions which involve multi-megabyte buffers for network sockets, etc. They can actually cause an impact to performance many times over! On another note, For some reason I occasionally got results like this: write: File too large 1+0 records in 1+0 records out I think the zvol was not considered created by that time. In about 10-15 sec I was able to commence the test run. Perhaps it helped that I "initialized" the zvol by a small write after creation, then: # dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1000 bs=512 Strange... -- This message posted from opensolaris.org
On Jul 9, 2009, at 4:22 AM, Jim Klimov <no-reply at opensolaris.org> wrote:> To tell the truth, I expected zvols to be faster than filesystem > datasets. They seem > to have less overhead without inodes, posix, acls and so on. So I''m > puzzled by test > results. > > I''m now considering the dd i/o block size, and it means a lot > indeed, especially if > compared to zvol results with small blocks like 64k. > > I ran a number of tests with a zvol recreated by commands before > each run (this > may however cause varying fragmentation impacting results of > different runs): > > # zfs destroy -r pond/test; zfs create -V 30G pond/test; zfs set > compression=off pond/test; sync; dd if=/dev/zero of=/dev/zvol/rdsk/ > pond/test count=1000 bs=512; sync > > and tests going like > > # time dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1024 bs=1048576 > 1024+0 records in > 1024+0 records out > > real 0m42.442s > user 0m0.006s > sys 0m4.292s > > The test progresses were quite jumpy (with "zpool iostat pond 1" > values varying > from 30 to 70 MBps, reads coming in sometimes). > > So I''d stick to overall result - the rounded wallclock time it takes > to write 1024 > records of varying size and resulting average end-user MBps. I also > write "sys" > time since that''s what is consumed by the kernel and the disk > subsystem, after all. > I don''t write zpool iostat speeds, since they vary too much and I > don''t bother > with a spreadsheen right now. But the reported values stay about > halfway between > "wallclock MBps" ans "sys MBps" calculations, on the perceived > average, peaking > at about 350MBps for large block sizes (>4MB). > > 1 MB (bs=1048576): 42s (4s), 24MBps > 4 MB (bs=4194304): 42s (15s), 96MBps > 16MB (bs=16777216): 129s-148s (62-64s), 127-110MBps > 32MB (bs=33554432, 40Gb zvol): 303s (127s), 108MBps > > Similar results for writing a file to a filesystem; "zpool iostat" > values again > jumped anywhere between single MBps to GBps. Simple cleanups used > like: > > # rm /pool/test30g; sync; time dd if=/dev/zero of=/pool/test30g > count=1024 bs=33554432 > > Values remain somewhat consistent (in the same league, at least): > 1 MB (bs=1048576, 10240 blocks): 20-21s (7-8s), 512-487MBps > > 1 MB (bs=1048576): 2.3s (0.6s), 445MBps > 4 MB (bs=4194304): 8s (3s), 512MBps > 16MB (bs=16777216): 37s (15s), 442MBps > 32MB (bs=33554432): 74-103s (32-42s), 442-318MBps > > 64Kb (bs=65536, 545664 blocks): 94s (47s), 362MBps > > All in all, to make more precise results these tests should be made > in greater > numbers and averaged. But here we got some figures to think about... > > On a side note, now I''ll pay more attention to tuning suggestions > which involve > multi-megabyte buffers for network sockets, etc. They can actually > cause an > impact to performance many times over! > > On another note, > > For some reason I occasionally got results like this: > write: File too large > 1+0 records in > 1+0 records out > > I think the zvol was not considered created by that time. In about > 10-15 sec I was > able to commence the test run. Perhaps it helped that I > "initialized" the zvol by a > small write after creation, then: > # dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1000 bs=512 > Strange...When running throughput tests the block sizes to be concerned about are: 4k, 8k, 16k and 64k. These are the sizes that most file systems an databases use. If you get 4k to perform well chances are the others will fall into line. -Ross
After reading many-many threads on ZFS performance today (top of the list in the forum, and some chains of references), I applied a bit of tuning to the server. In particular, I''ve set the zfs_write_limit_override to 384Mb so my cache is spooled to disks more frequently (if streaming lots of writes) and in smaller increments. * echo zfs_write_limit_override/W0t402653184 | mdb -kw set zfs:zfs_write_limit_override = 0x18000000 The system seems to be working more smoothly (vs. jerky), and "zpool iostat" values are not quite as jumpy (i.e. 320MBps to 360MBps for a certain test). The results also seem faster and more consistent. With this tuning applied, I''m writing to a 40G zvol, 1M records (count=1048576) of: 4k (bs=4096): 17s (12s), 241MBps 8k (bs=8192): 29s (18s), 282MBps 16k (bs=16384): 54s (30s), 303MBps 32k (bs=32768): 113s (56s), 290MBps 64k (bs=65536): 269s (104s), 243MBps And 10240 larger records of: 1 MB (bs=1048576): 33s (8s), 310MBps 2 MB (bs=2097152): 74s (23s), 276MBps And 1024 yet larger records: 1 MB (bs=1048576): 4s (1s), 256MBps 4 MB (bs=4194304): 12s (5s), 341MBps 16MB (bs=16777216): 71s (18s), 230MBps 32MB (bs=33554432): 150s (36s), 218MBps So the zvol picture is quite better now (albeit not perfect - i.e. no values are near the 1GBps noted previously in "zpool iostat"), for both small and large blocks. For filesystem dataset the new values are very similar (like, to tenths of a second on smaller blocksizes!) but as the blocksize grows, filesystems start losing to the zvols. Overall the result seems lower than achieved before I tried tuning. 1M records (count=1048576) of: 4k (bs=4096): 17s (12s), 241MBps 8k (bs=8192): 29s (18s), 282MBps 16k (bs=16384): 67s (30s), 245MBps 32k (bs=32768): 144s (55s), 228MBps 64k (bs=65536): 275s (98s), 238MBps And 10240 larger records go better: 1 MB (bs=1048576): 33s (9s), 310MBps 2 MB (bs=2097152): 70s (21s), 292MBps And 1024 yet larger records: 1 MB (bs=1048576): 2.8s (0.8s), 366MBps 4 MB (bs=4194304): 12s (4s), 341MBps 16MB (bs=16777216): 55s (17s), 298MBps 32MB (bs=33554432): 140s (36s), 234MBps Occasionally I did reruns; user time for the same setups can vary significantly (like 65s vs 84s) while the system time stays pretty much the same. "zpool iostat" shows larger values (like 320MBps typically) but I think that can be attributed to writing parity stripes on raidz vdevs. //Jim PS: for completeness, I''ll try smaller blocks without tuning in a future post. -- This message posted from opensolaris.org
Writes using the character interface (/dev/zvol/rdsk) are synchronous. If you want caching, you can go through the block interface (/dev/zvol/dsk) instead. - Eric -- This message posted from opensolaris.org