thr3ads.net - zfs discuss - [zfs-discuss] strange results ... [Oct 2009]

If this information is useful, please help other people find it:
Share via:

Jens Elkner

2009-Oct-22 05:11 UTC

[zfs-discuss] strange results ...

Hmmm,

wondering about IMHO strange ZFS results ...

X4440:	4x6 2.8GHz cores (Opteron 8439 SE), 64 GB RAM
	6x Sun STK RAID INT V1.0 (Hitachi H103012SCSUN146G SAS)
	Nevada b124

Started with a simple test using zfs on c1t0d0s0: cd /var/tmp

(1) time sh -c ''mkfile 32g bla ; sync'' 
0.16u 19.88s 5:04.15 6.5%
(2) time sh -c ''mkfile 32g blabla ; sync''
0.13u 46.41s 5:22.65 14.4%
(3) time sh -c ''mkfile 32g blablabla ; sync''
0.19u 26.88s 5:38.07 8.0%

chmod 644 b*
(4) time dd if=bla of=/dev/null bs=128k
262144+0 records in
262144+0 records out
0.26u 25.34s 6:06.16 6.9%
(5) time dd if=blabla of=/dev/null bs=128k
262144+0 records in
262144+0 records out
0.15u 26.67s 4:46.63 9.3%
(6) time dd if=blablabla of=/dev/null bs=128k
262144+0 records in
262144+0 records out
0.10u 20.56s 0:20.68 99.9%

So 1-3 is more or less expected (~97..108 MB/s write).
However 4-6 looks strange: 89, 114 and 1585 MB/s read!

Since the arc size is ~55+-2GB (at least arcstat.pl says so), I guess (6)
reads from memory completely. Hmm - maybe.
However, I would expect, that when repeating 5-6, ''blablabla''
gets replaced
by ''bla'' or ''blabla''. But the numbers say,
that ''blablabla'' is kept in the
cache, since I get almost the same results as in the first run (and zpool
iostat/arcstat.pl show for the blablabla almost no activity at all).
So is this a ZFS bug? Or does the OS some magic here?

2nd)
Never had a Sun STK RAID INT before. Actually my intention was to
create a zpool mirror of sd0 and sd1 for boot and logs, and a 2x2-way 
zpool mirror with the 4 remaining disks. However, the controller seems
not to support JBODs :( - which is also bad, since we can''t simply put
those disks into another machine with a different controller without
data loss, because the controller seems to use its own format under the
hood.  Also the 256MB BBCache seems to be a little bit small for ZIL
even if one would know, how to configure it ...

So what would you recommend? Creating 2 appropriate STK INT arrays
and using both as a single zpool device, i.e. without ZFS mirror devs
and 2nd copies? 

Intent workload is MySQL DBs + VBox images wrt. to the 4 disk *mirror,
logs and OS for the 2 disk *mirror, and should also act as a sunray
server (user homes and add. apps are comming from another server via NFS).

Any hints?

Regards,
jel.
-- 
Otto-von-Guericke University     http://www.cs.uni-magdeburg.de/
Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany         Tel: +49 391 67 12768

Marion Hakanson

2009-Oct-22 18:55 UTC

head link

[zfs-discuss] strange results ...

jel+zfs at cs.uni-magdeburg.de said:> 2nd) Never had a Sun STK RAID INT before. Actually my intention was to
create
> a zpool mirror of sd0 and sd1 for boot and logs, and a 2x2-way  zpool
mirror
> with the 4 remaining disks. However, the controller seems not to support
> JBODs :( - which is also bad, since we can''t simply put those
disks into
> another machine with a different controller without data loss, because the
> controller seems to use its own format under the hood.
Yes, those Adaptec/STK internal RAID cards are annoying for use with ZFS.
You also cannot replace a failed disk without using the STK RAID software
to configure the new disk as a standalone volume (before "zpool
replace").
Fortunately you probably don''t need to boot into the BIOS-level
utility,
I think you can use the Adaptec StorMan utilities from within the OS, if
you remembered to install them.

>  Also the 256MB
> BBCache seems to be a little bit small for ZIL even if one would know, how
to
> configure it ...
Unless you have an external (non-NV cached) pool on the same server, you
wouldn''t gain anything from setting up a separate ZIL in this case. 
All
your internal drives have NV cache without doing anything special.

> So what would you recommend? Creating 2 appropriate STK INT arrays and
using
> both as a single zpool device, i.e. without ZFS mirror devs and 2nd copies?
Here''s what we did:  Configure all internal disks as standalone volumes
on
the RAID card.  All those volumes have the battery-backed cache enabled.
The first two 146GB drives got sliced in two:  the first half of each disk
became the boot/root mirror pool.  The 2nd half was used for a separate-ZIL
mirror, applied to an external SATA pool.

Our remaining internal drives were configured into a mirrored ZFS pool
for database transaction logs.  No need for a separate ZIL there, since
the internal drives effectively have NV cache as far as ZFS is concerned.

Yes, the 256MB cache is small, but if it fills up, it is backed by the
10kRPM internal SAS drives, which should have decent latency when compared
to external SATA JBOD drives.  And even this tiny NV cache makes a huge
difference when used on an NFS server:
	http://acc.ohsu.edu/~hakansom/j4400_bench.html

Regards,

Marion

Robert Milkowski

2009-Oct-22 19:25 UTC

head link

[zfs-discuss] strange results ...

Jens Elkner wrote:> Hmmm,
>
> wondering about IMHO strange ZFS results ...
>
> X4440:	4x6 2.8GHz cores (Opteron 8439 SE), 64 GB RAM
> 	6x Sun STK RAID INT V1.0 (Hitachi H103012SCSUN146G SAS)
> 	Nevada b124
>
> Started with a simple test using zfs on c1t0d0s0: cd /var/tmp
>
> (1) time sh -c ''mkfile 32g bla ; sync'' 
> 0.16u 19.88s 5:04.15 6.5%
> (2) time sh -c ''mkfile 32g blabla ; sync''
> 0.13u 46.41s 5:22.65 14.4%
> (3) time sh -c ''mkfile 32g blablabla ; sync''
> 0.19u 26.88s 5:38.07 8.0%
>
> chmod 644 b*
> (4) time dd if=bla of=/dev/null bs=128k
> 262144+0 records in
> 262144+0 records out
> 0.26u 25.34s 6:06.16 6.9%
> (5) time dd if=blabla of=/dev/null bs=128k
> 262144+0 records in
> 262144+0 records out
> 0.15u 26.67s 4:46.63 9.3%
> (6) time dd if=blablabla of=/dev/null bs=128k
> 262144+0 records in
> 262144+0 records out
> 0.10u 20.56s 0:20.68 99.9%
>
> So 1-3 is more or less expected (~97..108 MB/s write).
> However 4-6 looks strange: 89, 114 and 1585 MB/s read!
>
> Since the arc size is ~55+-2GB (at least arcstat.pl says so), I guess (6)
> reads from memory completely. Hmm - maybe.
> However, I would expect, that when repeating 5-6,
''blablabla'' gets replaced
> by ''bla'' or ''blabla''. But the numbers
say, that ''blablabla'' is kept in the
> cache, since I get almost the same results as in the first run (and zpool
> iostat/arcstat.pl show for the blablabla almost no activity at all).
> So is this a ZFS bug? Or does the OS some magic here?
>
>   IIRC zfs when detects sequential reads in a given file will stop caching 
block for it.
So because #6 wes created as last one all its blocks are cached in arc, 
then when reading in #4 and #5 zfs detected sequential read and did not 
put data in a cache leaving last written file entirely cached.

While for many workloads this is desired behavior for many other it is 
not (like parsing with grep like tool large log files which are not 
getting cached...).


-- 
Robert Milkowski
http://milek.blogspot.com

zfs discuss - Oct 2009 - strange results ...

[zfs-discuss] strange results ...

[zfs-discuss] strange results ...

[zfs-discuss] strange results ...