Dan Naumov
2010-Jan-24 16:36 UTC
8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance
Note: Since my issue is slow performance right off the bat and not performance degradation over time, I decided to start a separate discussion. After installing a fresh pure ZFS 8.0 system and building all my ports, I decided to do some benchmarking. At this point, about a dozen of ports has been built installed and the system has been up for about 11 hours, No heavy background services have been running, only SSHD and NTPD: =================================================================================bonnie -s 8192: -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 8192 23821 61.7 22311 19.2 13928 13.7 25029 49.6 44806 17.2 135.0 3.1 During the process, TOP looks like this: last pid: 83554; load averages: 0.31, 0.31, 0.37 up 0+10:59:01 17:24:19 33 processes: 2 running, 31 sleeping CPU: 0.1% user, 0.0% nice, 14.1% system, 0.7% interrupt, 85.2% idle Mem: 45M Active, 4188K Inact, 568M Wired, 144K Cache, 1345M Free Swap: 3072M Total, 3072M Free Oh wow, that looks low, alright, lets run it again, just to be sure: -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 8192 18235 46.7 23137 19.9 13927 13.6 24818 49.3 44919 17.3 134.3 2.1 OK, let's reboot the machine and see what kind of numbers we get on a fresh boot: ============================================================== -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 8192 21041 53.5 22644 19.4 13724 12.8 25321 48.5 43110 14.0 143.2 3.3 Nope, no help from the reboot, still very low speed. Here is my pool: ============================================================== zpool status pool: tank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror ONLINE 0 0 0 ad10s1a ONLINE 0 0 0 ad8s1a ONLINE 0 0 0 ============================================================== diskinfo -c -t /dev/ad10 /dev/ad10 512 # sectorsize 2000398934016 # mediasize in bytes (1.8T) 3907029168 # mediasize in sectors 3876021 # Cylinders according to firmware. 16 # Heads according to firmware. 63 # Sectors according to firmware. WD-WCAVY0301430 # Disk ident. I/O command overhead: time to read 10MB block 0.164315 sec = 0.008 msec/sector time to read 20480 sectors 3.030396 sec = 0.148 msec/sector calculated command overhead = 0.140 msec/sector Seek times: Full stroke: 250 iter in 7.309334 sec = 29.237 msec Half stroke: 250 iter in 5.156117 sec = 20.624 msec Quarter stroke: 500 iter in 8.147588 sec = 16.295 msec Short forward: 400 iter in 2.544309 sec = 6.361 msec Short backward: 400 iter in 2.007679 sec = 5.019 msec Seq outer: 2048 iter in 0.392994 sec = 0.192 msec Seq inner: 2048 iter in 0.332582 sec = 0.162 msec Transfer rates: outside: 102400 kbytes in 1.576734 sec = 64944 kbytes/sec middle: 102400 kbytes in 1.381803 sec = 74106 kbytes/sec inside: 102400 kbytes in 2.145432 sec = 47729 kbytes/sec ============================================================== diskinfo -c -t /dev/ad8 /dev/ad8 512 # sectorsize 2000398934016 # mediasize in bytes (1.8T) 3907029168 # mediasize in sectors 3876021 # Cylinders according to firmware. 16 # Heads according to firmware. 63 # Sectors according to firmware. WD-WCAVY1611513 # Disk ident. I/O command overhead: time to read 10MB block 0.176820 sec = 0.009 msec/sector time to read 20480 sectors 2.966564 sec = 0.145 msec/sector calculated command overhead = 0.136 msec/sector Seek times: Full stroke: 250 iter in 7.993339 sec = 31.973 msec Half stroke: 250 iter in 5.944923 sec = 23.780 msec Quarter stroke: 500 iter in 9.744406 sec = 19.489 msec Short forward: 400 iter in 2.511171 sec = 6.278 msec Short backward: 400 iter in 2.233714 sec = 5.584 msec Seq outer: 2048 iter in 0.427523 sec = 0.209 msec Seq inner: 2048 iter in 0.341185 sec = 0.167 msec Transfer rates: outside: 102400 kbytes in 1.516305 sec = 67533 kbytes/sec middle: 102400 kbytes in 1.351877 sec = 75747 kbytes/sec inside: 102400 kbytes in 2.090069 sec = 48994 kbytes/sec ============================================================== The exact same disks, on the exact same machine, are well capable of 65+ mb/s throughput (tested with ATTO multiple times) with different block sizes using Windows 2008 Server and NTFS. So what would be the cause of these very low Bonnie result numbers in my case? Should I try some other benchmark and if so, with what parameters? - Sincerely, Dan Naumov
Dan Naumov
2010-Jan-24 17:42 UTC
8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance
On Sun, Jan 24, 2010 at 7:05 PM, Jason Edwards <sub.mesa@gmail.com> wrote:> Hi Dan, > > I read on FreeBSD mailinglist you had some performance issues with ZFS. > Perhaps i can help you with that. > > You seem to be running a single mirror, which means you won't have any speed > benefit regarding writes, and usually RAID1 implementations offer little to > no acceleration to read requests also; some even just read from the master > disk and don't touch the 'slave' mirrored disk unless when writing. ZFS is > alot more modern however, although i did not test performance of its mirror > implementation. > > But, benchmarking I/O can be tricky: > > 1) you use bonnie, but bonnie's tests are performed without a 'cooldown' > period between the tests; meaning that when test 2 starts, data from test 1 > is still being processed. For single disks and simple I/O this is not so > bad, but for large write-back buffers and more complex I/O buffering, this > may be inappropriate. I had patched bonnie some time in the past, but if you > just want a MB/s number you can use DD for that. > > 2) The diskinfo tiny benchmark is single queue only i assume, meaning that > it would not scale well or at all on RAID-arrays. Actual filesystems on > RAID-arrays use multiple-queue; meaning it would not read one sector at a > time, but read 8 blocks (of 16KiB) "ahead"; this is called read-ahead and > for traditional UFS filesystems its controlled by the sysctl vfs.read_max > variable. ZFS works differently though, but you still need a "real" > benchmark. > > 3) You need low-latency hardware; in particular, no PCI controller should be > used. Only PCI-express based controllers or chipset-integrated Serial ATA > cotrollers have proper performance. PCI can hurt performance very badly, and > has high interrupt CPU usage. Generally you should avoid PCI. PCI-express is > fine though, its a completely different interface that is in many ways the > opposite of what PCI was. > > 4) Testing actual realistic I/O performance (in IOps) is very difficult. But > testing sequential performance should be alot easier. You may try using dd > for this. > > > For example, you can use dd on raw devices: > > dd if=/dev/ad4 of=/dev/null bs=1M count=1000 > > I will explain each parameter: > > if=/dev/ad4 is the input file, the "read source" > > of=/dev/null is the output file, the "write destination". /dev/null means it > just goes no-where; so this is a read-only benchmark > > bs=1M is the blocksize, howmuch data to transfer per time. default is 512 or > the sector size; but that's very slow. A value between 64KiB and 1024KiB is > appropriate. bs=1M will select 1MiB or 1024KiB. > > count=1000 means transfer 1000 pieces, and with bs=1M that means 1000 * 1MiB > = 1000MiB. > > > > This example was raw reading sequentially from the start of the device > /dev/ad4. If you want to test RAIDs, you need to work at the filesystem > level. You can use dd for that too: > > dd if=/dev/zero of=/path/to/ZFS/mount/zerofile.000 bs=1M count=2000 > > This command will read from /dev/zero (all zeroes) and write to a file on > ZFS-mounted filesystem, it will create the file "zerofile.000" and write > 2000MiB of zeroes to that file. > So this command tests write-performance of the ZFS-mounted filesystem. To > test read performance, you need to clear caches first by unmounting that > filesystem and re-mounting it again. This would free up memory containing > parts of the filesystem as cached (reported in top as "Inact(ive)" instead > of "Free"). > > Please do make sure you double-check a dd command before running it, and run > as normal user instead of root. A wrong dd command may write to the wrong > destination and do things you don't want. The only real thing you need to > check is the write destination (of=....). That's where dd is going to write > to, so make sure its the target you intended. A common mistake made by > myself was to write dd of=... if=... (starting with of instead of if) and > thus actually doing something the other way around than what i was meant to > do. This can be disastrous if you work with live data, so be careful! ;-) > > Hope any of this was helpful. During the dd benchmark, you can of course > open a second SSH terminal and start "gstat" to see the devices current I/O > stats. > > Kind regards, > JasonHi and thanks for your tips, I appreciate it :) [jago@atombsd ~]$ dd if=/dev/zero of=/home/jago/test1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes transferred in 36.206372 secs (29656156 bytes/sec) [jago@atombsd ~]$ dd if=/dev/zero of=/home/jago/test2 bs=1M count=4096 4096+0 records in 4096+0 records out 4294967296 bytes transferred in 143.878615 secs (29851325 bytes/sec) This works out to 1GB in 36,2 seconds / 28,2mb/s in the first test and 4GB in 143.8 seconds / 28,4mb/s and somewhat consistent with the bonnie results. It also sadly seems to confirm the very slow speed :( The disks are attached to a 4-port Sil3124 controller and again, my Windows benchmarks showing 65mb/s+ were done on exact same machine, with same disks attached to the same controller. Only difference was that in Windows the disks weren't in a mirror configuration but were tested individually. I do understand that a mirror setup offers roughly the same write speed as individual disk, while the read speed usually varies from "equal to individual disk speed" to "nearly the throughput of both disks combined" depending on the implementation, but there is no obvious reason I am seeing why my setup offers both read and write speeds roughly 1/3 to 1/2 of what the individual disks are capable of. Dmesg shows: atapci0: <SiI 3124 SATA300 controller> port 0x1000-0x100f mem 0x90108000-0x9010807f,0x90100000-0x90107fff irq 21 at device 0.0 on pci4 ad8: 1907729MB <WDC WD20EADS-32R6B0 01.00A01> at ata4-master SATA300 ad10: 1907729MB <WDC WD20EADS-00R6B0 01.00A01> at ata5-master SATA300 I do recall also testing an alternative configuration in the past, where I would boot off an UFS disk and have the ZFS mirror consist of 2 discs directly. The bonnie numbers in that case were in line with my expectations, I was seeing 65-70mb/s. Note: again, exact same hardware, exact same disks attached to the exact same controller. In my knowledge, Solaris/OpenSolaris has an issue where they have to automatically disable disk cache if ZFS is used on top of partitions instead of raw disks, but to my knowledge (I recall reading this from multiple reputable sources) this issue does not affect FreeBSD. - Sincerely, Dan Naumov
Dan Naumov
2010-Jan-24 18:30 UTC
8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance
On Sun, Jan 24, 2010 at 8:12 PM, Bob Friesenhahn <bfriesen@simple.dallas.tx.us> wrote:> On Sun, 24 Jan 2010, Dan Naumov wrote: >> >> This works out to 1GB in 36,2 seconds / 28,2mb/s in the first test and >> 4GB in 143.8 seconds / 28,4mb/s and somewhat consistent with the >> bonnie results. It also sadly seems to confirm the very slow speed :( >> The disks are attached to a 4-port Sil3124 controller and again, my >> Windows benchmarks showing 65mb/s+ were done on exact same machine, >> with same disks attached to the same controller. Only difference was >> that in Windows the disks weren't in a mirror configuration but were >> tested individually. I do understand that a mirror setup offers >> roughly the same write speed as individual disk, while the read speed >> usually varies from "equal to individual disk speed" to "nearly the >> throughput of both disks combined" depending on the implementation, >> but there is no obvious reason I am seeing why my setup offers both >> read and write speeds roughly 1/3 to 1/2 of what the individual disks >> are capable of. Dmesg shows: > > There is a mistatement in the above in that a "mirror setup offers roughly > the same write speed as individual disk". ?It is possible for a mirror setup > to offer a similar write speed to an individual disk, but it is also quite > possible to get 1/2 (or even 1/3) the speed. ZFS writes to a mirror pair > requires two independent writes. ?If these writes go down independent I/O > paths, then there is hardly any overhead from the 2nd write. ?If the writes > go through a bandwidth-limited shared path then they will contend for that > bandwidth and you will see much less write performance. > > As a simple test, you can temporarily remove the mirror device from the pool > and see if the write performance dramatically improves. Before doing that, > it is useful to see the output of 'iostat -x 30' while under heavy write > load to see if one device shows a much higher svc_t value than the other.Ow, ow, WHOA: atombsd# zpool offline tank ad8s1a [jago@atombsd ~]$ dd if=/dev/zero of=/home/jago/test3 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes transferred in 16.826016 secs (63814382 bytes/sec) Offlining one half of the mirror bumps DD write speed from 28mb/s to 64mb/s! Let's see how Bonnie results change: Mirror with both parts attached: -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 8192 18235 46.7 23137 19.9 13927 13.6 24818 49.3 44919 17.3 134.3 2.1 Mirror with 1 half offline: -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 1024 22888 58.0 41832 35.1 22764 22.0 26775 52.3 54233 18.3 166.0 1.6 Ok, the Bonnie results have improved, but only very little. - Sincerely, Dan Naumov
Dan Naumov
2010-Jan-24 18:40 UTC
8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance
On Sun, Jan 24, 2010 at 8:34 PM, Jason Edwards <sub.mesa@gmail.com> wrote:>> ZFS writes to a mirror pair >> requires two independent writes. ?If these writes go down independent I/O >> paths, then there is hardly any overhead from the 2nd write. ?If the >> writes >> go through a bandwidth-limited shared path then they will contend for that >> bandwidth and you will see much less write performance. > > What he said may confirm my suspicion on PCI. So if you could try the same > with "real" Serial ATA via chipset or PCI-e controller you can confirm this > story. I would be very interested. :P > > Kind regards, > JasonThis wouldn't explain why ZFS mirror on 2 disks directly, on the exact same controller (with the OS running off a separate disks) results in "expected" performance, while having the OS run off/on a ZFS mirror running on top of MBR-partitioned disks, on the same controller, results in very low speed. - Dan
Alexander Motin
2010-Jan-24 21:54 UTC
8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance
Dan Naumov wrote:> This works out to 1GB in 36,2 seconds / 28,2mb/s in the first test and > 4GB in 143.8 seconds / 28,4mb/s and somewhat consistent with the > bonnie results. It also sadly seems to confirm the very slow speed :( > The disks are attached to a 4-port Sil3124 controller and again, my > Windows benchmarks showing 65mb/s+ were done on exact same machine, > with same disks attached to the same controller. Only difference was > that in Windows the disks weren't in a mirror configuration but were > tested individually. I do understand that a mirror setup offers > roughly the same write speed as individual disk, while the read speed > usually varies from "equal to individual disk speed" to "nearly the > throughput of both disks combined" depending on the implementation, > but there is no obvious reason I am seeing why my setup offers both > read and write speeds roughly 1/3 to 1/2 of what the individual disks > are capable of. Dmesg shows: > > atapci0: <SiI 3124 SATA300 controller> port 0x1000-0x100f mem > 0x90108000-0x9010807f,0x90100000-0x90107fff irq 21 at device 0.0 on > pci4 > ad8: 1907729MB <WDC WD20EADS-32R6B0 01.00A01> at ata4-master SATA300 > ad10: 1907729MB <WDC WD20EADS-00R6B0 01.00A01> at ata5-master SATA3008.0-RELEASE, and especially 8-STABLE provide alternative, much more functional driver for this controller, named siis(4). If your SiI3124 card installed into proper bus (PCI-X or PCIe x4/x8), it can be really fast (up to 1GB/s was measured). -- Alexander Motin
James R. Van Artsdalen
2010-Feb-03 10:51 UTC
8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance
Dan Naumov wrote:> [jago@atombsd ~]$ dd if=/dev/zero of=/home/jago/test2 bs=1M count=4096 > 4096+0 records in > 4096+0 records out > 4294967296 bytes transferred in 143.878615 secs (29851325 bytes/sec) > > This works out to 1GB in 36,2 seconds / 28,2mb/s in the first test and > 4GB in 143.8 seconds / 28,4mb/sFor the record, better results can be seen. In my test I put 3 Seagate Barracuda XT drives in a port multiplier and connected that to one port of a PCIe 3124 card. The MIRROR case is at about the I/O bandwidth limit of those drives. [root@kraken ~]# zpool create tmpx ada{2,3,4} [root@kraken ~]# dd if=/dev/zero of=/tmpx/test2 bs=1M count=4096 4096+0 records in 4096+0 records out 4294967296 bytes transferred in 20.892818 secs (205571470 bytes/sec) [root@kraken ~]# zpool destroy tmpx [root@kraken ~]# zpool create tmpx mirror ada{2,3} [root@kraken ~]# dd if=/dev/zero of=/tmpx/test2 bs=1M count=4096 4096+0 records in 4096+0 records out 4294967296 bytes transferred in 36.432818 secs (117887321 bytes/sec) [root@kraken ~]#