thr3ads.net - freebsd stable - 8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Dan Naumov

2010-Jan-24 16:36 UTC

8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance

Note: Since my issue is slow performance right off the bat and not
performance degradation over time, I decided to start a separate
discussion. After installing a fresh pure ZFS 8.0 system and building
all my ports, I decided to do some benchmarking. At this point, about
a dozen of ports has been built installed and the system has been up
for about 11 hours, No heavy background services have been running,
only SSHD and NTPD:

=================================================================================bonnie
-s 8192:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
         8192 23821 61.7 22311 19.2 13928 13.7 25029 49.6 44806 17.2 135.0  3.1
		
During the process, TOP looks like this:

last pid: 83554;  load averages:  0.31,  0.31,  0.37  up 0+10:59:01  17:24:19
33 processes:  2 running, 31 sleeping
CPU:  0.1% user,  0.0% nice, 14.1% system,  0.7% interrupt, 85.2% idle
Mem: 45M Active, 4188K Inact, 568M Wired, 144K Cache, 1345M Free
Swap: 3072M Total, 3072M Free
		
Oh wow, that looks low, alright, lets run it again, just to be sure:
		
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
         8192 18235 46.7 23137 19.9 13927 13.6 24818 49.3 44919 17.3 134.3  2.1
		
OK, let's reboot the machine and see what kind of numbers we get on a
fresh boot:

==============================================================
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
         8192 21041 53.5 22644 19.4 13724 12.8 25321 48.5 43110 14.0 143.2  3.3
		
Nope, no help from the reboot, still very low speed. Here is my pool:

==============================================================
zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        tank         ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            ad10s1a  ONLINE       0     0     0
            ad8s1a   ONLINE       0     0     0

==============================================================
diskinfo -c -t /dev/ad10
/dev/ad10
        512             # sectorsize
        2000398934016   # mediasize in bytes (1.8T)
        3907029168      # mediasize in sectors
        3876021         # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        WD-WCAVY0301430 # Disk ident.

I/O command overhead:
        time to read 10MB block      0.164315 sec       =    0.008 msec/sector
        time to read 20480 sectors   3.030396 sec       =    0.148 msec/sector
        calculated command overhead                     =    0.140 msec/sector

Seek times:
        Full stroke:      250 iter in   7.309334 sec =   29.237 msec
        Half stroke:      250 iter in   5.156117 sec =   20.624 msec
        Quarter stroke:   500 iter in   8.147588 sec =   16.295 msec
        Short forward:    400 iter in   2.544309 sec =    6.361 msec
        Short backward:   400 iter in   2.007679 sec =    5.019 msec
        Seq outer:       2048 iter in   0.392994 sec =    0.192 msec
        Seq inner:       2048 iter in   0.332582 sec =    0.162 msec
Transfer rates:
        outside:       102400 kbytes in   1.576734 sec =    64944 kbytes/sec
        middle:        102400 kbytes in   1.381803 sec =    74106 kbytes/sec
        inside:        102400 kbytes in   2.145432 sec =    47729 kbytes/sec

==============================================================
diskinfo -c -t /dev/ad8
/dev/ad8
        512             # sectorsize
        2000398934016   # mediasize in bytes (1.8T)
        3907029168      # mediasize in sectors
        3876021         # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        WD-WCAVY1611513 # Disk ident.

I/O command overhead:
        time to read 10MB block      0.176820 sec       =    0.009 msec/sector
        time to read 20480 sectors   2.966564 sec       =    0.145 msec/sector
        calculated command overhead                     =    0.136 msec/sector

Seek times:
        Full stroke:      250 iter in   7.993339 sec =   31.973 msec
        Half stroke:      250 iter in   5.944923 sec =   23.780 msec
        Quarter stroke:   500 iter in   9.744406 sec =   19.489 msec
        Short forward:    400 iter in   2.511171 sec =    6.278 msec
        Short backward:   400 iter in   2.233714 sec =    5.584 msec
        Seq outer:       2048 iter in   0.427523 sec =    0.209 msec
        Seq inner:       2048 iter in   0.341185 sec =    0.167 msec
Transfer rates:
        outside:       102400 kbytes in   1.516305 sec =    67533 kbytes/sec
        middle:        102400 kbytes in   1.351877 sec =    75747 kbytes/sec
        inside:        102400 kbytes in   2.090069 sec =    48994 kbytes/sec

==============================================================
The exact same disks, on the exact same machine, are well capable of
65+ mb/s throughput (tested with ATTO multiple times) with different
block sizes using Windows 2008 Server and NTFS. So what would be the
cause of these very low Bonnie result numbers in my case? Should I try
some other benchmark and if so, with what parameters?

- Sincerely,
Dan Naumov

Dan Naumov

2010-Jan-24 17:42 UTC

head link

8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance

On Sun, Jan 24, 2010 at 7:05 PM, Jason Edwards <sub.mesa@gmail.com>
wrote:> Hi Dan,
>
> I read on FreeBSD mailinglist you had some performance issues with ZFS.
> Perhaps i can help you with that.
>
> You seem to be running a single mirror, which means you won't have any
speed
> benefit regarding writes, and usually RAID1 implementations offer little to
> no acceleration to read requests also; some even just read from the master
> disk and don't touch the 'slave' mirrored disk unless when
writing. ZFS is
> alot more modern however, although i did not test performance of its mirror
> implementation.
>
> But, benchmarking I/O can be tricky:
>
> 1) you use bonnie, but bonnie's tests are performed without a
'cooldown'
> period between the tests; meaning that when test 2 starts, data from test 1
> is still being processed. For single disks and simple I/O this is not so
> bad, but for large write-back buffers and more complex I/O buffering, this
> may be inappropriate. I had patched bonnie some time in the past, but if
you
> just want a MB/s number you can use DD for that.
>
> 2) The diskinfo tiny benchmark is single queue only i assume, meaning that
> it would not scale well or at all on RAID-arrays. Actual filesystems on
> RAID-arrays use multiple-queue; meaning it would not read one sector at a
> time, but read 8 blocks (of 16KiB) "ahead"; this is called
read-ahead and
> for traditional UFS filesystems its controlled by the sysctl vfs.read_max
> variable. ZFS works differently though, but you still need a
"real"
> benchmark.
>
> 3) You need low-latency hardware; in particular, no PCI controller should
be
> used. Only PCI-express based controllers or chipset-integrated Serial ATA
> cotrollers have proper performance. PCI can hurt performance very badly,
and
> has high interrupt CPU usage. Generally you should avoid PCI. PCI-express
is
> fine though, its a completely different interface that is in many ways the
> opposite of what PCI was.
>
> 4) Testing actual realistic I/O performance (in IOps) is very difficult.
But
> testing sequential performance should be alot easier. You may try using dd
> for this.
>
>
> For example, you can use dd on raw devices:
>
> dd if=/dev/ad4 of=/dev/null bs=1M count=1000
>
> I will explain each parameter:
>
> if=/dev/ad4 is the input file, the "read source"
>
> of=/dev/null is the output file, the "write destination".
/dev/null means it
> just goes no-where; so this is a read-only benchmark
>
> bs=1M is the blocksize, howmuch data to transfer per time. default is 512
or
> the sector size; but that's very slow. A value between 64KiB and
1024KiB is
> appropriate. bs=1M will select 1MiB or 1024KiB.
>
> count=1000 means transfer 1000 pieces, and with bs=1M that means 1000 *
1MiB
> = 1000MiB.
>
>
>
> This example was raw reading sequentially from the start of the device
> /dev/ad4. If you want to test RAIDs, you need to work at the filesystem
> level. You can use dd for that too:
>
> dd if=/dev/zero of=/path/to/ZFS/mount/zerofile.000 bs=1M count=2000
>
> This command will read from /dev/zero (all zeroes) and write to a file on
> ZFS-mounted filesystem, it will create the file "zerofile.000"
and write
> 2000MiB of zeroes to that file.
> So this command tests write-performance of the ZFS-mounted filesystem. To
> test read performance, you need to clear caches first by unmounting that
> filesystem and re-mounting it again. This would free up memory containing
> parts of the filesystem as cached (reported in top as
"Inact(ive)" instead
> of "Free").
>
> Please do make sure you double-check a dd command before running it, and
run
> as normal user instead of root. A wrong dd command may write to the wrong
> destination and do things you don't want. The only real thing you need
to
> check is the write destination (of=....). That's where dd is going to
write
> to, so make sure its the target you intended. A common mistake made by
> myself was to write dd of=... if=... (starting with of instead of if) and
> thus actually doing something the other way around than what i was meant to
> do. This can be disastrous if you work with live data, so be careful! ;-)
>
> Hope any of this was helpful. During the dd benchmark, you can of course
> open a second SSH terminal and start "gstat" to see the devices
current I/O
> stats.
>
> Kind regards,
> Jason
Hi and thanks for your tips, I appreciate it :)

[jago@atombsd ~]$ dd if=/dev/zero of=/home/jago/test1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 36.206372 secs (29656156 bytes/sec)

[jago@atombsd ~]$ dd if=/dev/zero of=/home/jago/test2 bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 143.878615 secs (29851325 bytes/sec)

This works out to 1GB in 36,2 seconds / 28,2mb/s in the first test and
4GB in 143.8 seconds / 28,4mb/s and somewhat consistent with the
bonnie results. It also sadly seems to confirm the very slow speed :(
The disks are attached to a 4-port Sil3124 controller and again, my
Windows benchmarks showing 65mb/s+ were done on exact same machine,
with same disks attached to the same controller. Only difference was
that in Windows the disks weren't in a mirror configuration but were
tested individually. I do understand that a mirror setup offers
roughly the same write speed as individual disk, while the read speed
usually varies from "equal to individual disk speed" to "nearly
the
throughput of both disks combined" depending on the implementation,
but there is no obvious reason I am seeing why my setup offers both
read and write speeds roughly 1/3 to 1/2 of what the individual disks
are capable of. Dmesg shows:

atapci0: <SiI 3124 SATA300 controller> port 0x1000-0x100f mem
0x90108000-0x9010807f,0x90100000-0x90107fff irq 21 at device 0.0 on
pci4
ad8: 1907729MB <WDC WD20EADS-32R6B0 01.00A01> at ata4-master SATA300
ad10: 1907729MB <WDC WD20EADS-00R6B0 01.00A01> at ata5-master SATA300

I do recall also testing an alternative configuration in the past,
where I would boot off an UFS disk and have the ZFS mirror consist of
2 discs directly. The bonnie numbers in that case were in line with my
expectations, I was seeing 65-70mb/s. Note: again, exact same
hardware, exact same disks attached to the exact same controller. In
my knowledge, Solaris/OpenSolaris has an issue where they have to
automatically disable disk cache if ZFS is used on top of partitions
instead of raw disks, but to my knowledge (I recall reading this from
multiple reputable sources) this issue does not affect FreeBSD.

- Sincerely,
Dan Naumov

Dan Naumov

2010-Jan-24 18:30 UTC

head link

8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance

On Sun, Jan 24, 2010 at 8:12 PM, Bob Friesenhahn
<bfriesen@simple.dallas.tx.us> wrote:> On Sun, 24 Jan 2010, Dan Naumov wrote:
>>
>> This works out to 1GB in 36,2 seconds / 28,2mb/s in the first test and
>> 4GB in 143.8 seconds / 28,4mb/s and somewhat consistent with the
>> bonnie results. It also sadly seems to confirm the very slow speed :(
>> The disks are attached to a 4-port Sil3124 controller and again, my
>> Windows benchmarks showing 65mb/s+ were done on exact same machine,
>> with same disks attached to the same controller. Only difference was
>> that in Windows the disks weren't in a mirror configuration but
were
>> tested individually. I do understand that a mirror setup offers
>> roughly the same write speed as individual disk, while the read speed
>> usually varies from "equal to individual disk speed" to
"nearly the
>> throughput of both disks combined" depending on the
implementation,
>> but there is no obvious reason I am seeing why my setup offers both
>> read and write speeds roughly 1/3 to 1/2 of what the individual disks
>> are capable of. Dmesg shows:
>
> There is a mistatement in the above in that a "mirror setup offers
roughly
> the same write speed as individual disk". ?It is possible for a mirror
setup
> to offer a similar write speed to an individual disk, but it is also quite
> possible to get 1/2 (or even 1/3) the speed. ZFS writes to a mirror pair
> requires two independent writes. ?If these writes go down independent I/O
> paths, then there is hardly any overhead from the 2nd write. ?If the writes
> go through a bandwidth-limited shared path then they will contend for that
> bandwidth and you will see much less write performance.
>
> As a simple test, you can temporarily remove the mirror device from the
pool
> and see if the write performance dramatically improves. Before doing that,
> it is useful to see the output of 'iostat -x 30' while under heavy
write
> load to see if one device shows a much higher svc_t value than the other.
Ow, ow, WHOA:

atombsd# zpool offline tank ad8s1a

[jago@atombsd ~]$ dd if=/dev/zero of=/home/jago/test3 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 16.826016 secs (63814382 bytes/sec)

Offlining one half of the mirror bumps DD write speed from 28mb/s to
64mb/s! Let's see how Bonnie results change:

Mirror with both parts attached:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
         8192 18235 46.7 23137 19.9 13927 13.6 24818 49.3 44919 17.3 134.3  2.1

Mirror with 1 half offline:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
         1024 22888 58.0 41832 35.1 22764 22.0 26775 52.3 54233 18.3 166.0  1.6

Ok, the Bonnie results have improved, but only very little.

- Sincerely,
Dan Naumov

Dan Naumov

2010-Jan-24 18:40 UTC

head link

8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance

On Sun, Jan 24, 2010 at 8:34 PM, Jason Edwards <sub.mesa@gmail.com>
wrote:>> ZFS writes to a mirror pair
>> requires two independent writes. ?If these writes go down independent
I/O
>> paths, then there is hardly any overhead from the 2nd write. ?If the
>> writes
>> go through a bandwidth-limited shared path then they will contend for
that
>> bandwidth and you will see much less write performance.
>
> What he said may confirm my suspicion on PCI. So if you could try the same
> with "real" Serial ATA via chipset or PCI-e controller you can
confirm this
> story. I would be very interested. :P
>
> Kind regards,
> Jason

This wouldn't explain why ZFS mirror on 2 disks directly, on the exact
same controller (with the OS running off a separate disks) results in
"expected" performance, while having the OS run off/on a ZFS mirror
running on top of MBR-partitioned disks, on the same controller,
results in very low speed.

- Dan

Alexander Motin

2010-Jan-24 21:54 UTC

head link

8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance

Dan Naumov wrote:> This works out to 1GB in 36,2 seconds / 28,2mb/s in the first test and
> 4GB in 143.8 seconds / 28,4mb/s and somewhat consistent with the
> bonnie results. It also sadly seems to confirm the very slow speed :(
> The disks are attached to a 4-port Sil3124 controller and again, my
> Windows benchmarks showing 65mb/s+ were done on exact same machine,
> with same disks attached to the same controller. Only difference was
> that in Windows the disks weren't in a mirror configuration but were
> tested individually. I do understand that a mirror setup offers
> roughly the same write speed as individual disk, while the read speed
> usually varies from "equal to individual disk speed" to
"nearly the
> throughput of both disks combined" depending on the implementation,
> but there is no obvious reason I am seeing why my setup offers both
> read and write speeds roughly 1/3 to 1/2 of what the individual disks
> are capable of. Dmesg shows:
> 
> atapci0: <SiI 3124 SATA300 controller> port 0x1000-0x100f mem
> 0x90108000-0x9010807f,0x90100000-0x90107fff irq 21 at device 0.0 on
> pci4
> ad8: 1907729MB <WDC WD20EADS-32R6B0 01.00A01> at ata4-master SATA300
> ad10: 1907729MB <WDC WD20EADS-00R6B0 01.00A01> at ata5-master SATA300
8.0-RELEASE, and especially 8-STABLE provide alternative, much more
functional driver for this controller, named siis(4). If your SiI3124
card installed into proper bus (PCI-X or PCIe x4/x8), it can be really
fast (up to 1GB/s was measured).

-- 
Alexander Motin

James R. Van Artsdalen

2010-Feb-03 10:51 UTC

head link

8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance

Dan Naumov wrote:> [jago@atombsd ~]$ dd if=/dev/zero of=/home/jago/test2 bs=1M count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes transferred in 143.878615 secs (29851325 bytes/sec)
>
> This works out to 1GB in 36,2 seconds / 28,2mb/s in the first test and
> 4GB in 143.8 seconds / 28,4mb/sFor the record, better results can be seen.  In my test I put 3 Seagate
Barracuda XT drives in a port multiplier and connected that to one port
of a PCIe 3124 card.

The MIRROR case is at about the I/O bandwidth limit of those drives.

[root@kraken ~]# zpool create tmpx ada{2,3,4}  
[root@kraken ~]# dd if=/dev/zero of=/tmpx/test2 bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 20.892818 secs (205571470 bytes/sec)
[root@kraken ~]# zpool destroy tmpx
[root@kraken ~]# zpool create tmpx mirror ada{2,3}
[root@kraken ~]# dd if=/dev/zero of=/tmpx/test2 bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 36.432818 secs (117887321 bytes/sec)
[root@kraken ~]#

freebsd stable - Jan 2010 - 8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance

8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance

8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance

8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance

8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance

8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance

8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance