Hmmm, wondering about IMHO strange ZFS results ... X4440: 4x6 2.8GHz cores (Opteron 8439 SE), 64 GB RAM 6x Sun STK RAID INT V1.0 (Hitachi H103012SCSUN146G SAS) Nevada b124 Started with a simple test using zfs on c1t0d0s0: cd /var/tmp (1) time sh -c ''mkfile 32g bla ; sync'' 0.16u 19.88s 5:04.15 6.5% (2) time sh -c ''mkfile 32g blabla ; sync'' 0.13u 46.41s 5:22.65 14.4% (3) time sh -c ''mkfile 32g blablabla ; sync'' 0.19u 26.88s 5:38.07 8.0% chmod 644 b* (4) time dd if=bla of=/dev/null bs=128k 262144+0 records in 262144+0 records out 0.26u 25.34s 6:06.16 6.9% (5) time dd if=blabla of=/dev/null bs=128k 262144+0 records in 262144+0 records out 0.15u 26.67s 4:46.63 9.3% (6) time dd if=blablabla of=/dev/null bs=128k 262144+0 records in 262144+0 records out 0.10u 20.56s 0:20.68 99.9% So 1-3 is more or less expected (~97..108 MB/s write). However 4-6 looks strange: 89, 114 and 1585 MB/s read! Since the arc size is ~55+-2GB (at least arcstat.pl says so), I guess (6) reads from memory completely. Hmm - maybe. However, I would expect, that when repeating 5-6, ''blablabla'' gets replaced by ''bla'' or ''blabla''. But the numbers say, that ''blablabla'' is kept in the cache, since I get almost the same results as in the first run (and zpool iostat/arcstat.pl show for the blablabla almost no activity at all). So is this a ZFS bug? Or does the OS some magic here? 2nd) Never had a Sun STK RAID INT before. Actually my intention was to create a zpool mirror of sd0 and sd1 for boot and logs, and a 2x2-way zpool mirror with the 4 remaining disks. However, the controller seems not to support JBODs :( - which is also bad, since we can''t simply put those disks into another machine with a different controller without data loss, because the controller seems to use its own format under the hood. Also the 256MB BBCache seems to be a little bit small for ZIL even if one would know, how to configure it ... So what would you recommend? Creating 2 appropriate STK INT arrays and using both as a single zpool device, i.e. without ZFS mirror devs and 2nd copies? Intent workload is MySQL DBs + VBox images wrt. to the 4 disk *mirror, logs and OS for the 2 disk *mirror, and should also act as a sunray server (user homes and add. apps are comming from another server via NFS). Any hints? Regards, jel. -- Otto-von-Guericke University http://www.cs.uni-magdeburg.de/ Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2 39106 Magdeburg, Germany Tel: +49 391 67 12768
jel+zfs at cs.uni-magdeburg.de said:> 2nd) Never had a Sun STK RAID INT before. Actually my intention was to create > a zpool mirror of sd0 and sd1 for boot and logs, and a 2x2-way zpool mirror > with the 4 remaining disks. However, the controller seems not to support > JBODs :( - which is also bad, since we can''t simply put those disks into > another machine with a different controller without data loss, because the > controller seems to use its own format under the hood.Yes, those Adaptec/STK internal RAID cards are annoying for use with ZFS. You also cannot replace a failed disk without using the STK RAID software to configure the new disk as a standalone volume (before "zpool replace"). Fortunately you probably don''t need to boot into the BIOS-level utility, I think you can use the Adaptec StorMan utilities from within the OS, if you remembered to install them.> Also the 256MB > BBCache seems to be a little bit small for ZIL even if one would know, how to > configure it ...Unless you have an external (non-NV cached) pool on the same server, you wouldn''t gain anything from setting up a separate ZIL in this case. All your internal drives have NV cache without doing anything special.> So what would you recommend? Creating 2 appropriate STK INT arrays and using > both as a single zpool device, i.e. without ZFS mirror devs and 2nd copies?Here''s what we did: Configure all internal disks as standalone volumes on the RAID card. All those volumes have the battery-backed cache enabled. The first two 146GB drives got sliced in two: the first half of each disk became the boot/root mirror pool. The 2nd half was used for a separate-ZIL mirror, applied to an external SATA pool. Our remaining internal drives were configured into a mirrored ZFS pool for database transaction logs. No need for a separate ZIL there, since the internal drives effectively have NV cache as far as ZFS is concerned. Yes, the 256MB cache is small, but if it fills up, it is backed by the 10kRPM internal SAS drives, which should have decent latency when compared to external SATA JBOD drives. And even this tiny NV cache makes a huge difference when used on an NFS server: http://acc.ohsu.edu/~hakansom/j4400_bench.html Regards, Marion
Jens Elkner wrote:> Hmmm, > > wondering about IMHO strange ZFS results ... > > X4440: 4x6 2.8GHz cores (Opteron 8439 SE), 64 GB RAM > 6x Sun STK RAID INT V1.0 (Hitachi H103012SCSUN146G SAS) > Nevada b124 > > Started with a simple test using zfs on c1t0d0s0: cd /var/tmp > > (1) time sh -c ''mkfile 32g bla ; sync'' > 0.16u 19.88s 5:04.15 6.5% > (2) time sh -c ''mkfile 32g blabla ; sync'' > 0.13u 46.41s 5:22.65 14.4% > (3) time sh -c ''mkfile 32g blablabla ; sync'' > 0.19u 26.88s 5:38.07 8.0% > > chmod 644 b* > (4) time dd if=bla of=/dev/null bs=128k > 262144+0 records in > 262144+0 records out > 0.26u 25.34s 6:06.16 6.9% > (5) time dd if=blabla of=/dev/null bs=128k > 262144+0 records in > 262144+0 records out > 0.15u 26.67s 4:46.63 9.3% > (6) time dd if=blablabla of=/dev/null bs=128k > 262144+0 records in > 262144+0 records out > 0.10u 20.56s 0:20.68 99.9% > > So 1-3 is more or less expected (~97..108 MB/s write). > However 4-6 looks strange: 89, 114 and 1585 MB/s read! > > Since the arc size is ~55+-2GB (at least arcstat.pl says so), I guess (6) > reads from memory completely. Hmm - maybe. > However, I would expect, that when repeating 5-6, ''blablabla'' gets replaced > by ''bla'' or ''blabla''. But the numbers say, that ''blablabla'' is kept in the > cache, since I get almost the same results as in the first run (and zpool > iostat/arcstat.pl show for the blablabla almost no activity at all). > So is this a ZFS bug? Or does the OS some magic here? > >IIRC zfs when detects sequential reads in a given file will stop caching block for it. So because #6 wes created as last one all its blocks are cached in arc, then when reading in #4 and #5 zfs detected sequential read and did not put data in a cache leaving last written file entirely cached. While for many workloads this is desired behavior for many other it is not (like parsing with grep like tool large log files which are not getting cached...). -- Robert Milkowski http://milek.blogspot.com