thr3ads.net - zfs discuss - [zfs-discuss] Very slow ZFS write speed to raw zvol [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Leon Verrall

2009-Jul-08 15:07 UTC

[zfs-discuss] Very slow ZFS write speed to raw zvol

Guys,

Have an opensolairs x86 box running:

SunOS thsudfile01 5.11 snv_111b i86pc i386 i86pc Solaris

This has 2 old qla2200 1Gbit FC cards attached. Each bus is connected to an old
transtec F/C raid array. This has a couple of large luns that form a single
large zpool:

root at thsudfile01:~# zpool status bucket
  pool: bucket
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        bucket      ONLINE       0     0     0
          c5t0d0    ONLINE       0     0     0
          c8t3d0    ONLINE       0     0     0

errors: No known data errors
root at thsudfile01:~# zfs list bucket
NAME     USED  AVAIL  REFER  MOUNTPOINT
bucket  2.69T  5.31T    22K  /bucket

This is being used as an iSCSI target for an ESX 4.0 development environemnt. I
found the performance to be really poor and found the culprit seems to be write
performance to the raw zvol. For example on this zfs filesystem allocated as a
volume:

root at thsudfile01:~# zfs list bucket/iSCSI/lun1
NAME                USED  AVAIL  REFER  MOUNTPOINT
bucket/iSCSI/lun1   250G  5.55T  3.64G  -

root at thsudfile01:~# dd if=/dev/zero of=/dev/zvol/rdsk/bucket/iSCSI/lun1
bs=65536 count=102400
^C7729+0 records in
7729+0 records out
506527744 bytes (507 MB) copied, 241.707 s, 2.1 MB/

Some zpool iostat 1 1000:

bucket      2.44T  5.68T      0    203      0  2.73M
bucket      2.44T  5.68T      0    216      0  2.83M
bucket      2.44T  5.68T      0    120  63.4K  1.58M
bucket      2.44T  5.68T      2    350   190K  16.9M
bucket      2.44T  5.68T      0    123      0  1.64M
bucket      2.44T  5.68T      0    230      0  3.02M

Read performance from that zvol (assuming /dev/null behaves properly) is fine:

root at thsudfile01:/bucket/transtec# dd of=/dev/null
if=/dev/zvol/rdsk/bucket/iSCSI/lun1 bs=65536 count=204800
204800+0 records in
204800+0 records out
13421772800 bytes (13 GB) copied, 47.0256 s, 285 MB/s

Somewhat optimistic that... but iostat shows 100MB/s ish.

Write to a zfs filesystem from that zpool is also fine, here with a a write big
enough to exhaust the machines 12GB memory:

root at thsudfile01:/bucket/transtec# dd if=/dev/zero of=FILE bs=65536
count=409600
^C
336645+0 records in
336645+0 records out
22062366720 bytes (22 GB) copied, 176.369 s, 125 MB/s

and bursts of cache flush from iostat:

bucket      2.44T  5.68T      0    342      0  38.7M
bucket      2.44T  5.68T      0  1.47K      0   188M
bucket      2.44T  5.68T      0    240      0  21.3M
bucket      2.44T  5.68T      0  1.54K      0   191M
bucket      2.44T  5.68T      0  1.49K      0   191M
bucket      2.44T  5.68T      0    434      0  44.2M

So we seem to be able to get data down to disk via the cache at a reasonable
rate and read from a raw zvol OK, but writes  are horribly slow.

Am I missing something obvious? let me know what info would be diagnostic and
I''ll post it...

Cheers,

Leon
-- 
This message posted from opensolaris.org

Jim Klimov

2009-Jul-09 06:15 UTC

head link

[zfs-discuss] Very slow ZFS write speed to raw zvol

Do you have any older benchmarks on these cards and arrays (in their pre-ZFS 
life?) Perhaps this is not a ZFS regression but a hardware config issue?

Perhaps there''s some caching (like per-disk write-through) not enabled
on the
arrays? As you may know, the ability (and reliability) of such cache often
depends
on presence of battery back-up (BBU) modules on an array controller.

It is also often pointed out that ZFS prefers raw disks instead of hardware
vendors''
RAID arrays (with possible errors crawling into data bits and undetected by the
firmware''s algorithms unless the whole disk breaks, or errors related
to bugs in
the firmware itself - in more elaborate conspiracy theories). 

Due to this it is often recommended to use external RAID implementations as
vdev''s to a redundant ZFS pool (such as a mirror of two equivalent
arrays).

//Jim
-- 
This message posted from opensolaris.org

Jim Klimov

2009-Jul-09 07:08 UTC

head link

[zfs-discuss] Very slow ZFS write speed to raw zvol

Hmm, scratch that. Maybe.

I did not first get the point that your writes to a filesystem dataset work
quickly.
Perhaps filesystem is (better) cached indeed, i.e. *maybe* zvol writes are 
synchronous and zfs writes may be cached and thus async? Try playing around 
with relevant dataset attributes...

I''m running a test on my system (a snv_114 Thumper, 16Gb RAM, used for
other
purposes as well), the CPU is mostly idle now (2.5-3.2% kernel time,
that''s about
it). Seems I have results not unlike yours. Not cool because I wanted to play
with
COMSTAR iSCSI - and I''m not sure it will perform well ;)

I''m dd''ing 30Gb to an uncompressed test zvol with same 64kb
block sizes (maybe
they are too small?), and zpool iostat goes like this - a hundred IOs at 7Mbps
for a
minute, then a burst of 100-170Mbps and 20-25K IOps for a second:

pond        5.79T  4.41T      0    106      0  7.09M
pond        5.79T  4.41T      0  1.93K      0  20.7M
pond        5.79T  4.41T      0  13.3K      0   106M
pond        5.79T  4.41T      0    116      0  7.76M
pond        5.79T  4.41T      0    108      0  7.23M
pond        5.79T  4.41T      0    107      0  7.16M
pond        5.79T  4.41T      0    107      0  7.16M

or

pond        5.79T  4.41T      0    117      0  7.83M
pond        5.79T  4.41T      0  5.61K      0  49.7M
pond        5.79T  4.41T      0  19.0K    504   149M
pond        5.79T  4.41T      0    104      0  6.96M

Weird indeed.

It wrote 10Gb (according to "zfs get usedbydataset pond/test") taking
roughly 30
minutes after which I killed it.

Now, writing to an uncompressed filesystem dataset (although very far from 
what''s trumpeted as Thumper performance) yields quite different
numbers:

pond        5.80T  4.40T      1  3.64K   1022   457M
pond        5.80T  4.40T      0    866    967  75.7M
pond        5.80T  4.40T      0  4.65K      0   586M
pond        5.80T  4.40T      6    802  33.4K  69.2M
pond        5.80T  4.40T     29  2.44K  1.10M   301M
pond        5.80T  4.40T     32    691   735K  25.0M
pond        5.80T  4.40T     56  1.59K  2.29M   184M
pond        5.80T  4.40T    150    768  4.61M  10.5M
pond        5.80T  4.40T      2      0  25.5K      0
pond        5.80T  4.40T      0  2.75K      0   341M
pond        5.80T  4.40T      7  3.96K   339K   497M
pond        5.80T  4.39T     85    740  3.57M  59.0M
pond        5.80T  4.39T     67      0  2.22M      0
pond        5.80T  4.39T      9  4.67K   292K   581M
pond        5.80T  4.39T      4  1.07K   126K   137M
pond        5.80T  4.39T     27    333   338K  9.15M
pond        5.80T  4.39T      5      0  28.0K  3.99K
pond        5.82T  4.37T      1  5.42K  1.67K   677M
pond        5.83T  4.37T      3  1.69K  8.36K   173M
pond        5.83T  4.37T      2      0  5.49K      0
pond        5.83T  4.37T      0  6.32K      0   790M
pond        5.83T  4.37T      2    290  7.95K  27.8M
pond        5.83T  4.37T      0  9.64K  1.23K  1.18G

The numbers are jumpy (maybe due to fragmentation, other processes, etc.) but
there are often spikes in excess of 500MBps.

The whole test took a relatively little time:

# time dd if=/dev/zero of=/pond/tmpnocompress/test30g bs=65536 count=500000
500000+0 records in
500000+0 records out

real    1m27.657s
user    0m0.302s
sys     0m46.976s

# du -hs /pond/tmpnocompress/test30g 
  30G   /pond/tmpnocompress/test30g

To detail about the pool: 

The pool is on a Sun X4500 with 48 250Gb SATA drives. It was created as a 9x5 
set (9 stripes made of 5-disk raidz1 vdevs) spread across different controllers,
with the command:

# zpool create -f pond \
raidz1 c0t0d0 c1t0d0 c4t0d0 c6t0d0 c7t0d0 \
raidz1 c0t1d0 c1t2d0 c4t3d0 c6t5d0 c7t6d0 \
raidz1 c1t1d0 c4t1d0 c5t1d0 c6t1d0 c7t1d0 \
raidz1 c0t2d0 c4t2d0 c5t2d0 c6t2d0 c7t2d0 \
raidz1 c0t3d0 c1t3d0 c5t3d0 c6t3d0 c7t3d0 \
raidz1 c0t4d0 c1t4d0 c4t4d0 c6t4d0 c7t4d0 \
raidz1 c0t5d0 c1t5d0 c4t5d0 c5t5d0 c7t5d0 \
raidz1 c0t6d0 c1t6d0 c4t6d0 c5t6d0 c6t6d0 \
raidz1 c1t7d0 c4t7d0 c5t7d0 c6t7d0 c7t7d0 \
spare c0t7d0

Alas, while there were many blogs, I couldn''t find a definitive answer
last year as
to which Thumper layout is optimal in performance and/or reliability (in regard
to
6 controllers of 8 disks each, with 2 disks on one of the controllers reserved
for
booting). 

As a result, we spread each raidz1 across 5 controllers, so the loss of one
controller should have minimal impact on data loss on the average. Since the 
system layout is not symmetrical, some controllers are more important than 
others (say, the boot one).

//Jim
-- 
This message posted from opensolaris.org

Jim Klimov

2009-Jul-09 08:22 UTC

head link

[zfs-discuss] Very slow ZFS write speed to raw zvol

To tell the truth, I expected zvols to be faster than filesystem datasets. They
seem
to have less overhead without inodes, posix, acls and so on. So I''m
puzzled by test
results.

I''m now considering the dd i/o block size, and it means a lot indeed,
especially if
compared to zvol results with small blocks like 64k. 

I ran a number of tests with a zvol recreated by commands before each run (this 
may however cause varying fragmentation impacting results of different runs):

# zfs destroy -r pond/test; zfs create -V 30G pond/test; zfs set compression=off
pond/test; sync; dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1000 bs=512;
sync

and tests going like

# time dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1024 bs=1048576
1024+0 records in
1024+0 records out

real    0m42.442s
user    0m0.006s
sys     0m4.292s

The test progresses were quite jumpy (with "zpool iostat pond 1"
values varying
from 30 to 70 MBps, reads coming in sometimes). 

So I''d stick to overall result - the rounded wallclock time it takes to
write 1024
records of varying size and resulting average end-user MBps. I also write
"sys"
time since that''s what is consumed by the kernel and the disk
subsystem, after all.
I don''t write zpool iostat speeds, since they vary too much and I
don''t bother
with a spreadsheen right now. But the reported values stay about halfway between
"wallclock MBps" ans "sys MBps" calculations, on the
perceived average, peaking
at about 350MBps for large block sizes (>4MB).

1 MB (bs=1048576): 42s (4s), 24MBps
4 MB (bs=4194304): 42s (15s), 96MBps
16MB (bs=16777216): 129s-148s (62-64s), 127-110MBps
32MB (bs=33554432, 40Gb zvol): 303s (127s), 108MBps

Similar results for writing a file to a filesystem; "zpool iostat"
values again
jumped anywhere between single MBps to GBps. Simple cleanups used like:

# rm /pool/test30g; sync; time dd if=/dev/zero of=/pool/test30g 
count=1024 bs=33554432

Values remain somewhat consistent (in the same league, at least):
1 MB (bs=1048576, 10240 blocks): 20-21s (7-8s), 512-487MBps

1 MB (bs=1048576): 2.3s (0.6s), 445MBps
4 MB (bs=4194304): 8s (3s), 512MBps
16MB (bs=16777216): 37s (15s), 442MBps
32MB (bs=33554432): 74-103s (32-42s), 442-318MBps

64Kb (bs=65536, 545664 blocks): 94s (47s), 362MBps

All in all, to make more precise results these tests should be made in greater 
numbers and averaged. But here we got some figures to think about...

On a side note, now I''ll pay more attention to tuning suggestions which
involve
multi-megabyte buffers for network sockets, etc. They can actually cause an 
impact to performance many times over!

On another note,

For some reason I occasionally got results like this:
write: File too large
1+0 records in
1+0 records out

I think the zvol was not considered created by that time. In about 10-15 sec I
was
able to commence the test run. Perhaps it helped that I "initialized"
the zvol by a
small write after creation, then: 
# dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1000 bs=512
Strange...
-- 
This message posted from opensolaris.org

Ross Walker

2009-Jul-09 14:05 UTC

head link

[zfs-discuss] Very slow ZFS write speed to raw zvol

On Jul 9, 2009, at 4:22 AM, Jim Klimov <no-reply at opensolaris.org>
wrote:
> To tell the truth, I expected zvols to be faster than filesystem  
> datasets. They seem
> to have less overhead without inodes, posix, acls and so on. So
I''m
> puzzled by test
> results.
>
> I''m now considering the dd i/o block size, and it means a lot  
> indeed, especially if
> compared to zvol results with small blocks like 64k.
>
> I ran a number of tests with a zvol recreated by commands before  
> each run (this
> may however cause varying fragmentation impacting results of  
> different runs):
>
> # zfs destroy -r pond/test; zfs create -V 30G pond/test; zfs set  
> compression=off pond/test; sync; dd if=/dev/zero of=/dev/zvol/rdsk/ 
> pond/test count=1000 bs=512; sync
>
> and tests going like
>
> # time dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1024 bs=1048576
> 1024+0 records in
> 1024+0 records out
>
> real    0m42.442s
> user    0m0.006s
> sys     0m4.292s
>
> The test progresses were quite jumpy (with "zpool iostat pond 1"
> values varying
> from 30 to 70 MBps, reads coming in sometimes).
>
> So I''d stick to overall result - the rounded wallclock time it
takes
> to write 1024
> records of varying size and resulting average end-user MBps. I also  
> write "sys"
> time since that''s what is consumed by the kernel and the disk  
> subsystem, after all.
> I don''t write zpool iostat speeds, since they vary too much and I
> don''t bother
> with a spreadsheen right now. But the reported values stay about  
> halfway between
> "wallclock MBps" ans "sys MBps" calculations, on the
perceived
> average, peaking
> at about 350MBps for large block sizes (>4MB).
>
> 1 MB (bs=1048576): 42s (4s), 24MBps
> 4 MB (bs=4194304): 42s (15s), 96MBps
> 16MB (bs=16777216): 129s-148s (62-64s), 127-110MBps
> 32MB (bs=33554432, 40Gb zvol): 303s (127s), 108MBps
>
> Similar results for writing a file to a filesystem; "zpool
iostat"
> values again
> jumped anywhere between single MBps to GBps. Simple cleanups used  
> like:
>
> # rm /pool/test30g; sync; time dd if=/dev/zero of=/pool/test30g
> count=1024 bs=33554432
>
> Values remain somewhat consistent (in the same league, at least):
> 1 MB (bs=1048576, 10240 blocks): 20-21s (7-8s), 512-487MBps
>
> 1 MB (bs=1048576): 2.3s (0.6s), 445MBps
> 4 MB (bs=4194304): 8s (3s), 512MBps
> 16MB (bs=16777216): 37s (15s), 442MBps
> 32MB (bs=33554432): 74-103s (32-42s), 442-318MBps
>
> 64Kb (bs=65536, 545664 blocks): 94s (47s), 362MBps
>
> All in all, to make more precise results these tests should be made  
> in greater
> numbers and averaged. But here we got some figures to think about...
>
> On a side note, now I''ll pay more attention to tuning suggestions
> which involve
> multi-megabyte buffers for network sockets, etc. They can actually  
> cause an
> impact to performance many times over!
>
> On another note,
>
> For some reason I occasionally got results like this:
> write: File too large
> 1+0 records in
> 1+0 records out
>
> I think the zvol was not considered created by that time. In about  
> 10-15 sec I was
> able to commence the test run. Perhaps it helped that I  
> "initialized" the zvol by a
> small write after creation, then:
> # dd if=/dev/zero of=/dev/zvol/rdsk/pond/test count=1000 bs=512
> Strange...
When running throughput tests the block sizes to be concerned about  
are: 4k, 8k, 16k and 64k. These are the sizes that most file systems  
an databases use.

If you get 4k to perform well chances are the others will fall into  
line.

-Ross

Jim Klimov

2009-Jul-09 19:52 UTC

head link

[zfs-discuss] Very slow ZFS write speed to raw zvol

After reading many-many threads on ZFS performance today (top of the list in the
forum, and some chains of references), I applied a bit of tuning to the server.

In particular, I''ve set the zfs_write_limit_override to 384Mb so my
cache is spooled
to disks more frequently (if streaming lots of writes) and in smaller
increments.

* echo zfs_write_limit_override/W0t402653184 | mdb -kw
set zfs:zfs_write_limit_override = 0x18000000

The system seems to be working more smoothly (vs. jerky), and "zpool
iostat"
values are not quite as jumpy (i.e. 320MBps to 360MBps for a certain test). 
The results also seem faster and more consistent.

With this tuning applied, I''m writing to a 40G zvol, 1M records
(count=1048576) of:
4k (bs=4096): 17s (12s), 241MBps
8k (bs=8192): 29s (18s), 282MBps
16k (bs=16384): 54s (30s), 303MBps
32k (bs=32768): 113s (56s), 290MBps
64k (bs=65536): 269s (104s), 243MBps

And 10240 larger records of:
1 MB (bs=1048576): 33s (8s), 310MBps
2 MB (bs=2097152): 74s (23s), 276MBps

And 1024 yet larger records:
1 MB (bs=1048576): 4s (1s), 256MBps
4 MB (bs=4194304): 12s (5s), 341MBps
16MB (bs=16777216): 71s (18s), 230MBps
32MB (bs=33554432): 150s (36s), 218MBps

So the zvol picture is quite better now (albeit not perfect - i.e. no values are
near
the 1GBps noted previously in "zpool iostat"), for both small and
large blocks.

For filesystem dataset the new values are very similar (like, to tenths of a
second
on smaller blocksizes!) but as the blocksize grows, filesystems start losing to
the
zvols. Overall the result seems lower than achieved before I tried tuning.

1M records (count=1048576) of:
4k (bs=4096): 17s (12s), 241MBps
8k (bs=8192): 29s (18s), 282MBps
16k (bs=16384): 67s (30s), 245MBps
32k (bs=32768): 144s (55s), 228MBps
64k (bs=65536): 275s (98s), 238MBps

And 10240 larger records go better:
1 MB (bs=1048576): 33s (9s), 310MBps
2 MB (bs=2097152): 70s (21s), 292MBps

And 1024 yet larger records:
1 MB (bs=1048576): 2.8s (0.8s), 366MBps
4 MB (bs=4194304): 12s (4s), 341MBps
16MB (bs=16777216): 55s (17s), 298MBps
32MB (bs=33554432): 140s (36s), 234MBps

Occasionally I did reruns; user time for the same setups can vary significantly
(like 65s vs 84s) while the system time stays pretty much the same.

"zpool iostat" shows larger values (like 320MBps typically) but I
think that can be
attributed to writing parity stripes on raidz vdevs.

//Jim

PS: for completeness, I''ll try smaller blocks without tuning in a
future post.
-- 
This message posted from opensolaris.org

Eric C. Taylor

2009-Jul-10 14:46 UTC

head link

[zfs-discuss] Very slow ZFS write speed to raw zvol

Writes using the character interface (/dev/zvol/rdsk) are synchronous.
If you want caching, you can go through the block interface
(/dev/zvol/dsk) instead.

-  Eric
-- 
This message posted from opensolaris.org

zfs discuss - Jul 2009 - Very slow ZFS write speed to raw zvol

[zfs-discuss] Very slow ZFS write speed to raw zvol

[zfs-discuss] Very slow ZFS write speed to raw zvol

[zfs-discuss] Very slow ZFS write speed to raw zvol

[zfs-discuss] Very slow ZFS write speed to raw zvol

[zfs-discuss] Very slow ZFS write speed to raw zvol

[zfs-discuss] Very slow ZFS write speed to raw zvol

[zfs-discuss] Very slow ZFS write speed to raw zvol