thr3ads.net - zfs discuss - [zfs-discuss] NFS slow for small files: idle disks [Jan 2011]

If this information is useful, please help other people find it:
Share via:
Michael Hase
2011-Jan-20 22:40 UTC
[zfs-discuss] NFS slow for small files: idle disks

The discussion is really old: writing many small files on an nfs mounted zfs
filesystem is slow without ssd zil due to the sync nature of the nfs protocol
itself. But there is something I don''t really understand. My tests on
an old opteron box with 2 small u160 scsi arrays and a zpool with 4 mirrored
vdevs built from 146gb disks show mostly idle disks when untarring  an archive
with many small files over nfs. Any source package can be used for this test.
I''m on zpool version 22 (still sxce b130, the client is opensolaris
b130), nfs mount options are all default, NFSD_SERVERS=128.

Configuration of the pool is like this:
zpool status ib1
  pool: ib1
 state: ONLINE
 scrub: scrub completed after 0h52m with 0 errors on Sat Jan 15 14:19:02 2011
config:

        NAME        STATE     READ WRITE CKSUM
        ib1         ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
            c3t0d0  ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            c1t6d0  ONLINE       0     0     0
            c4t0d0  ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            c3t3d0  ONLINE       0     0     0
            c4t3d0  ONLINE       0     0     0
          mirror-3  ONLINE       0     0     0
            c3t4d0  ONLINE       0     0     0
            c4t4d0  ONLINE       0     0     0

zpool iostat -v shows

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
ib1          268G   276G      0    180      0   723K
  mirror    95.4G  40.6G      0     44      0   180K
    c1t4d0      -      -      0     44      0   180K
    c3t0d0      -      -      0     44      0   180K
  mirror    95.2G  40.8G      0     44      0   180K
    c1t6d0      -      -      0     44      0   180K
    c4t0d0      -      -      0     44      0   180K
  mirror    39.0G  97.0G      0     45      0   184K
    c3t3d0      -      -      0     45      0   184K
    c4t3d0      -      -      0     45      0   184K
  mirror    38.5G  97.5G      0     44      0   180K
    c3t4d0      -      -      0     44      0   180K
    c4t4d0      -      -      0     44      0   180K
----------  -----  -----  -----  -----  -----  -----

So each disk gets 40-50 iops, 180 ops on the whole pool (mirrored). Note that
these u320 scsi disks should be able to handle about 150 iops per disk, so
theres no iops aggregation. The strange thing is the following iostat -MindexC
output:

                           extended device statistics       ---- errors --- 
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot
device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0  14   0  14 c0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0  14   0  14
c0t0d0
    0.0  186.0    0.0    0.4  0.0  0.0    0.0    0.1   0   2   0   0   0   0 c1
    0.0   93.0    0.0    0.2  0.0  0.0    0.0    0.1   0   1   0   0   0   0
c1t4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c1t5d0
    0.0   93.0    0.0    0.2  0.0  0.0    0.0    0.1   0   1   0   0   0   0
c1t6d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c2
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c2t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c2t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c2t2d0
    0.0  279.5    0.0    0.5  0.0  0.0    0.0    0.1   0   3   0   0   0   0 c3
    0.0   93.0    0.0    0.2  0.0  0.0    0.0    0.1   0   1   0   0   0   0
c3t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c3t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c3t2d0
    0.0   93.0    0.0    0.2  0.0  0.0    0.0    0.1   0   1   0   0   0   0
c3t3d0
    0.0   93.5    0.0    0.2  0.0  0.0    0.0    0.1   0   1   0   0   0   0
c3t4d0
    0.0  279.0    0.0    0.5  0.0  0.0    0.0    0.2   0   5   0   0   0   0 c4
    0.0   93.0    0.0    0.2  0.0  0.0    0.0    0.3   0   3   0   0   0   0
c4t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c4t2d0
    0.0   93.0    0.0    0.2  0.0  0.0    0.0    0.1   0   1   0   0   0   0
c4t4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0
c4t1d0
    0.0   93.0    0.0    0.2  0.0  0.0    0.0    0.1   0   1   0   0   0   0
c4t3d0

Service times for the involved disks are around 0.1-0.3 msec, I think this is
the sequential write nature of zfs. The disks are at most 3% busy. When writing
synchronous I''d expect 100% busy disks. And when reading or writing
locally the disks really get busy, about 50 MB/sec per disk due to the 160
MB/sec scsi bus limitation per channel (there are 2 u160 channels with 3 disks
each, and 1 channel with 2 disks).

Richard Ellings zilstat gives

   N-Bytes  N-Bytes/s N-Max-Rate    B-Bytes  B-Bytes/s B-Max-Rate    ops 
<=4kB 4-32kB >=32kB
      9552       9552       9552     671744     671744     671744    164    164 
0      0
     10192      10192      10192     724992     724992     724992    177    177 
0      0
      9568       9568       9568     679936     679936     679936    166    166 
0      0
     11712      11712      11712     823296     823296     823296    201    201 
0      0
     10784      10784      10784     765952     765952     765952    187    187 
0      0
     10024      10024      10024     708608     708608     708608    173    173 
0      0

About 200 zil ops all < 4k as maximum. As said the disks aren''t busy
during this test.

The test zfs ist configured with atime off. logbias nearly doesn''t
matter, with logbias=latency the iops rate is a little bit lower.

Attached are some bonnie++ results to show, that all disks and the whole pool
are quite healthy. I get > 1000 random reads/sec local and still nearly 900
reads/sec via nfs. For large files I easily get gbit wirespeed (105 MB/sec read)
with nfs. And for random reads in a bonnie or iozone test the disks are really
80%-100% busy. Just for small files the array sits almost idle, the array can do
way more. I discovered this on different solaris versions, not only this test
system. Is there any explanation for this behaviour?

Thanks,
Michael
-- 
This message posted from opensolaris.org
-------------- next part --------------
local

Version 1.03c       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
ibmr10          16G           108972  25 89923  21           263540  26  1074  
3
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 30359  99 +++++ +++ +++++ +++ 24836  99 +++++ +++ +++++ +++
ibmr10,16G,,,108972,25,89923,21,,,263540,26,1073.5,3,16,30359,99,+++++,+++,+++++,+++,24836,99,+++++,+++,+++++,+++
-------------- next part --------------
NFS

Version 1.03d       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
nfsibmr10       16G           50022  11 42524  14           105335  18 884.8  20
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   152   3 +++++ +++   182   1   151   3 +++++ +++   183   1
nfsibmr10,16G,,,50022,11,42524,14,,,105335,18,884.8,20,16,152,3,+++++,+++,182,1,151,3,+++++,+++,183,1
zfs discuss - Jan 2011 - NFS slow for small files: idle disks

[zfs-discuss] NFS slow for small files: idle disks