thr3ads.net - zfs discuss - [zfs-discuss] Poor performance on NFS-exported ZFS volumes [Mar 2006]

If this information is useful, please help other people find it:
Share via:

Dale Ghent

2006-Mar-23 22:10 UTC

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

I''m seeing some pretty pitiful performance using ZFS on a NFS server,
with a ZFS volume exported (only with rw=host.foo.com,root=host.foo.com opts)
and mounted on a Linux host running kernel 2.4.31. This linux kernel
I''m working with is limited in that I can only do NFSv2 mounts...
irregardless of that aspect, I''m sure something''s amiss.

I mounted the zfs-based nfs share on the linux host and tested it like so and
ran tests with dd:

---------------
[root at hercules]/>mount -o rw,vers=2,hard,intr,rsize=8192,wsize=8192
ds2.rs:/ds-store/rs/test /umbc/test
[root at hercules]/>time dd if=/dev/zero of=/umbc/test/testfile1 bs=16k
count=16384
16384+0 records in
16384+0 records out

real    170m51.619s
user    0m0.060s
sys     0m5.720s
---------------

Ooof. 170 minutes to write a 256MB file. Sanity-check time, lets try the same
thing on a UFS-backed export from the same server:

---------------
[root at hercules]/>mount -o rw,vers=2,hard,intr,rsize=8192,wsize=8192
ds2.rs:/ds-backup /umbc/test
[root at hercules]/>time dd if=/dev/zero of=/umbc/test/testfile1 bs=16k
count=16384
16384+0 records in
16384+0 records out

real    0m22.989s
user    0m0.040s
sys     0m3.880s
---------------

22 seconds. That''s more like it. Something must be wrong with
ZFS<->NFS, obviously. Is this a known problem? I see some NFS-related ZFS
bugs in the bug tracker, but none of them seem to complain about such egregious
slowness, and most are marked as closed anyway.

Here are vitals on the NFS server:

[root at ds2]/s#uname -a
SunOS ds2.rs 5.11 snv_35 i86pc i386 i86pc

[root at ds2]/#zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
ds-store               6.31T   1.42M   6.31T     0%  ONLINE     -

[root at ds2]/#zpool status
  pool: ds-store
 state: ONLINE
 scrub: none requested
config:
        NAME                                       STATE     READ WRITE CKSUM
        ds-store                                   ONLINE       0     0     0
          raidz                                    ONLINE       0     0     0
            c6t60003930000153A501000000D52ADA12d0  ONLINE       0     0     0
            c6t60003930000153A5020000006A95D52Ad0  ONLINE       0     0     0
            c6t60003930000153A503000000B54A6A95d0  ONLINE       0     0     0
            c6t60003930000153A5040000005AA5B54Ad0  ONLINE       0     0     0
            c6t60003930000153A505000000AD525AA5d0  ONLINE       0     0     0
            c6t60003930000153A50600000056A9AD52d0  ONLINE       0     0     0
            c6t60003930000153A507000000AB5456A9d0  ONLINE       0     0     0
            c6t600039300001546301000000D52AC2B9d0  ONLINE       0     0     0
            c6t6000393000015463020000006A95D52Ad0  ONLINE       0     0     0
            c6t600039300001546303000000B54A6A95d0  ONLINE       0     0     0
            c6t6000393000015463040000005AA5B54Ad0  ONLINE       0     0     0
            c6t600039300001546305000000AD525AA5d0  ONLINE       0     0     0
            c6t60003930000154630600000056A9AD52d0  ONLINE       0     0     0
            c6t600039300001546307000000AB5456A9d0  ONLINE       0     0     0

[root at ds2]/#zfs get all ds-store/rs/test
NAME             PROPERTY       VALUE                      SOURCE
ds-store/rs/test  type           filesystem                 -
ds-store/rs/test  creation       Thu Mar 23 11:23 2006      -
ds-store/rs/test  used           149K                       -
ds-store/rs/test  available      35.0G                      -
ds-store/rs/test  referenced     149K                       -
ds-store/rs/test  compressratio  1.00x                      -
ds-store/rs/test  mounted        yes                        -
ds-store/rs/test  quota          35G                        local
ds-store/rs/test  reservation    none                       default
ds-store/rs/test  recordsize     128K                       default
ds-store/rs/test  mountpoint     /ds-store/rs/test        default
ds-store/rs/test  sharenfs       rw=hercules,root=hercules  local
ds-store/rs/test  checksum       on                         default
ds-store/rs/test  compression    on                         inherited from
ds-store
ds-store/rs/test  atime          on                         default
ds-store/rs/test  devices        on                         default
ds-store/rs/test  exec           on                         default
ds-store/rs/test  setuid         on                         default
ds-store/rs/test  readonly       off                        default
ds-store/rs/test  zoned          off                        default
ds-store/rs/test  snapdir        visible                    default
ds-store/rs/test  aclmode        groupmask                  default
ds-store/rs/test  aclinherit     secure                     default

The NFS-exported UFS file system on this same server:

[root at ds2]/#metastat -c d100
d100             s  5.5TB /dev/dsk/c6t60003930000153DA01000000D52AB182d0s0
/dev/dsk/c6t60003930000153EB01000000D52AC9CCd0s0

[root at ds2]/#mount | grep ds-backup
/ds-backup on /dev/md/dsk/d100
read/write/setuid/devices/intr/largefiles/logging/xattr/onerror=panic/dev=1540064
on Thu Mar 23 00:10:34 2006
This message posted from opensolaris.org

Noel Dellofano

2006-Mar-23 22:44 UTC

head link

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

Sounds like something isn''t quite happy.  Could you grab a snoop trace
while you''re doing the dd?
On the server become root and do:
  #snoop -o snoop.out  <clientname>

thanks,
Noel :-)


************************************************************************ 
**

"Question all the answers"
On Mar 23, 2006, at 2:10 PM, Dale Ghent wrote:
> I''m seeing some pretty pitiful performance using ZFS on a NFS
server,
> with a ZFS volume exported (only with  
> rw=host.foo.com,root=host.foo.com opts) and mounted on a Linux host  
> running kernel 2.4.31. This linux kernel I''m working with is
limited
> in that I can only do NFSv2 mounts... irregardless of that aspect,
I''m
> sure something''s amiss.
>
> I mounted the zfs-based nfs share on the linux host and tested it like  
> so and ran tests with dd:
>
> ---------------
> [root at hercules]/>mount -o rw,vers=2,hard,intr,rsize=8192,wsize=8192  
> ds2.rs:/ds-store/rs/test /umbc/test
> [root at hercules]/>time dd if=/dev/zero of=/umbc/test/testfile1 bs=16k
> count=16384
> 16384+0 records in
> 16384+0 records out
>
> real    170m51.619s
> user    0m0.060s
> sys     0m5.720s
> ---------------
>
> Ooof. 170 minutes to write a 256MB file. Sanity-check time, lets try  
> the same thing on a UFS-backed export from the same server:
>
> ---------------
> [root at hercules]/>mount -o rw,vers=2,hard,intr,rsize=8192,wsize=8192  
> ds2.rs:/ds-backup /umbc/test
> [root at hercules]/>time dd if=/dev/zero of=/umbc/test/testfile1 bs=16k
> count=16384
> 16384+0 records in
> 16384+0 records out
>
> real    0m22.989s
> user    0m0.040s
> sys     0m3.880s
> ---------------
>
> 22 seconds. That''s more like it. Something must be wrong with  
> ZFS<->NFS, obviously. Is this a known problem? I see some NFS-related
> ZFS bugs in the bug tracker, but none of them seem to complain about  
> such egregious slowness, and most are marked as closed anyway.
>
> Here are vitals on the NFS server:
>
> [root at ds2]/s#uname -a
> SunOS ds2.rs 5.11 snv_35 i86pc i386 i86pc
>
> [root at ds2]/#zpool list
> NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
> ds-store               6.31T   1.42M   6.31T     0%  ONLINE     -
>
> [root at ds2]/#zpool status
>   pool: ds-store
>  state: ONLINE
>  scrub: none requested
> config:
>         NAME                                       STATE     READ  
> WRITE CKSUM
>         ds-store                                   ONLINE       0      
> 0     0
>           raidz                                    ONLINE       0      
> 0     0
>             c6t60003930000153A501000000D52ADA12d0  ONLINE       0      
> 0     0
>             c6t60003930000153A5020000006A95D52Ad0  ONLINE       0      
> 0     0
>             c6t60003930000153A503000000B54A6A95d0  ONLINE       0      
> 0     0
>             c6t60003930000153A5040000005AA5B54Ad0  ONLINE       0      
> 0     0
>             c6t60003930000153A505000000AD525AA5d0  ONLINE       0      
> 0     0
>             c6t60003930000153A50600000056A9AD52d0  ONLINE       0      
> 0     0
>             c6t60003930000153A507000000AB5456A9d0  ONLINE       0      
> 0     0
>             c6t600039300001546301000000D52AC2B9d0  ONLINE       0      
> 0     0
>             c6t6000393000015463020000006A95D52Ad0  ONLINE       0      
> 0     0
>             c6t600039300001546303000000B54A6A95d0  ONLINE       0      
> 0     0
>             c6t6000393000015463040000005AA5B54Ad0  ONLINE       0      
> 0     0
>             c6t600039300001546305000000AD525AA5d0  ONLINE       0      
> 0     0
>             c6t60003930000154630600000056A9AD52d0  ONLINE       0      
> 0     0
>             c6t600039300001546307000000AB5456A9d0  ONLINE       0      
> 0     0
>
> [root at ds2]/#zfs get all ds-store/rs/test
> NAME             PROPERTY       VALUE                      SOURCE
> ds-store/rs/test  type           filesystem                 -
> ds-store/rs/test  creation       Thu Mar 23 11:23 2006      -
> ds-store/rs/test  used           149K                       -
> ds-store/rs/test  available      35.0G                      -
> ds-store/rs/test  referenced     149K                       -
> ds-store/rs/test  compressratio  1.00x                      -
> ds-store/rs/test  mounted        yes                        -
> ds-store/rs/test  quota          35G                        local
> ds-store/rs/test  reservation    none                       default
> ds-store/rs/test  recordsize     128K                       default
> ds-store/rs/test  mountpoint     /ds-store/rs/test        default
> ds-store/rs/test  sharenfs       rw=hercules,root=hercules  local
> ds-store/rs/test  checksum       on                         default
> ds-store/rs/test  compression    on                         inherited  
> from ds-store
> ds-store/rs/test  atime          on                         default
> ds-store/rs/test  devices        on                         default
> ds-store/rs/test  exec           on                         default
> ds-store/rs/test  setuid         on                         default
> ds-store/rs/test  readonly       off                        default
> ds-store/rs/test  zoned          off                        default
> ds-store/rs/test  snapdir        visible                    default
> ds-store/rs/test  aclmode        groupmask                  default
> ds-store/rs/test  aclinherit     secure                     default
>
> The NFS-exported UFS file system on this same server:
>
> [root at ds2]/#metastat -c d100
> d100             s  5.5TB  
> /dev/dsk/c6t60003930000153DA01000000D52AB182d0s0  
> /dev/dsk/c6t60003930000153EB01000000D52AC9CCd0s0
>
> [root at ds2]/#mount | grep ds-backup
> /ds-backup on /dev/md/dsk/d100  
> read/write/setuid/devices/intr/largefiles/logging/xattr/onerror=panic/ 
> dev=1540064 on Thu Mar 23 00:10:34 2006
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

eric kustarz

2006-Mar-23 22:48 UTC

head link

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

Dale Ghent wrote:
>I''m seeing some pretty pitiful performance using ZFS on a NFS
server, with a ZFS volume exported (only with rw=host.foo.com,root=host.foo.com
opts) and mounted on a Linux host running kernel 2.4.31. This linux kernel
I''m working with is limited in that I can only do NFSv2 mounts...
irregardless of that aspect, I''m sure something''s amiss.
>
>I mounted the zfs-based nfs share on the linux host and tested it like so
and ran tests with dd:
>
>---------------
>[root at hercules]/>mount -o rw,vers=2,hard,intr,rsize=8192,wsize=8192
ds2.rs:/ds-store/rs/test /umbc/test
>[root at hercules]/>time dd if=/dev/zero of=/umbc/test/testfile1 bs=16k
count=16384
>16384+0 records in
>16384+0 records out
>
>real    170m51.619s
>user    0m0.060s
>sys     0m5.720s
>---------------
>
>Ooof. 170 minutes to write a 256MB file. Sanity-check time, lets try the
same thing on a UFS-backed export from the same server:
>
>  
>
Hmmm, i just tried this on solaris vs. solaris and got it to complete in 
about a minute (with the server as 32 or 64 bit) over a dirty network - 
ufs took about 43 seconds:
fsh-mullet# mount -o vers=2 hodur:/z /mnt                                
fsh-mullet# /bin/time dd if=/dev/zero of=/mnt/testfile1 bs=16k count=16384
16384+0 records in
16384+0 records out

real     1:06.6
user        0.0
sys         1.7
fsh-mullet#

fsh-mullet# mount -o vers=2 hodur:/ufs /mnt
fsh-mullet# /bin/time dd if=/dev/zero of=/mnt/testfile1 bs=16k count=16384
16384+0 records in
16384+0 records out

real       43.1
user        0.0
sys         1.7
fsh-mullet#

The server is a 2way opteron with 3.5G of memory running 3/21 nevada bits.

The client is a 2way sparc with 2G of memory running 3/22 nevada bits.

and you have a clean network?  what type of machines (and are they 
running 32 or 64bit) do you have?  was the server doing anything else?

i assume its reproducible - if so, a snoop trace is always good... plus 
you can dtrace zfs to see what''s slow.

 From having worked on NFS, i can say that running 2.4 linux NFS bits is 
suspect, and lots of improvements have been made in the 2.6 tree.

Neil just putback some more improvements to build 37 for ZFS w/ NFS, but 
it shouldn''t matter here... wonder what''s different in your
environment.

eric
>---------------
>[root at hercules]/>mount -o rw,vers=2,hard,intr,rsize=8192,wsize=8192
ds2.rs:/ds-backup /umbc/test
>[root at hercules]/>time dd if=/dev/zero of=/umbc/test/testfile1 bs=16k
count=16384
>16384+0 records in
>16384+0 records out
>
>real    0m22.989s
>user    0m0.040s
>sys     0m3.880s
>---------------
>
>22 seconds. That''s more like it. Something must be wrong with
ZFS<->NFS, obviously. Is this a known problem? I see some NFS-related ZFS
bugs in the bug tracker, but none of them seem to complain about such egregious
slowness, and most are marked as closed anyway.
>
>Here are vitals on the NFS server:
>
>[root at ds2]/s#uname -a
>SunOS ds2.rs 5.11 snv_35 i86pc i386 i86pc
>
>[root at ds2]/#zpool list
>NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
>ds-store               6.31T   1.42M   6.31T     0%  ONLINE     -
>
>[root at ds2]/#zpool status
>  pool: ds-store
> state: ONLINE
> scrub: none requested
>config:
>        NAME                                       STATE     READ WRITE
CKSUM
>        ds-store                                   ONLINE       0     0    
0
>          raidz                                    ONLINE       0     0    
0
>            c6t60003930000153A501000000D52ADA12d0  ONLINE       0     0    
0
>            c6t60003930000153A5020000006A95D52Ad0  ONLINE       0     0    
0
>            c6t60003930000153A503000000B54A6A95d0  ONLINE       0     0    
0
>            c6t60003930000153A5040000005AA5B54Ad0  ONLINE       0     0    
0
>            c6t60003930000153A505000000AD525AA5d0  ONLINE       0     0    
0
>            c6t60003930000153A50600000056A9AD52d0  ONLINE       0     0    
0
>            c6t60003930000153A507000000AB5456A9d0  ONLINE       0     0    
0
>            c6t600039300001546301000000D52AC2B9d0  ONLINE       0     0    
0
>            c6t6000393000015463020000006A95D52Ad0  ONLINE       0     0    
0
>            c6t600039300001546303000000B54A6A95d0  ONLINE       0     0    
0
>            c6t6000393000015463040000005AA5B54Ad0  ONLINE       0     0    
0
>            c6t600039300001546305000000AD525AA5d0  ONLINE       0     0    
0
>            c6t60003930000154630600000056A9AD52d0  ONLINE       0     0    
0
>            c6t600039300001546307000000AB5456A9d0  ONLINE       0     0    
0
>
>[root at ds2]/#zfs get all ds-store/rs/test
>NAME             PROPERTY       VALUE                      SOURCE
>ds-store/rs/test  type           filesystem                 -
>ds-store/rs/test  creation       Thu Mar 23 11:23 2006      -
>ds-store/rs/test  used           149K                       -
>ds-store/rs/test  available      35.0G                      -
>ds-store/rs/test  referenced     149K                       -
>ds-store/rs/test  compressratio  1.00x                      -
>ds-store/rs/test  mounted        yes                        -
>ds-store/rs/test  quota          35G                        local
>ds-store/rs/test  reservation    none                       default
>ds-store/rs/test  recordsize     128K                       default
>ds-store/rs/test  mountpoint     /ds-store/rs/test        default
>ds-store/rs/test  sharenfs       rw=hercules,root=hercules  local
>ds-store/rs/test  checksum       on                         default
>ds-store/rs/test  compression    on                         inherited from
ds-store
>ds-store/rs/test  atime          on                         default
>ds-store/rs/test  devices        on                         default
>ds-store/rs/test  exec           on                         default
>ds-store/rs/test  setuid         on                         default
>ds-store/rs/test  readonly       off                        default
>ds-store/rs/test  zoned          off                        default
>ds-store/rs/test  snapdir        visible                    default
>ds-store/rs/test  aclmode        groupmask                  default
>ds-store/rs/test  aclinherit     secure                     default
>
>The NFS-exported UFS file system on this same server:
>
>[root at ds2]/#metastat -c d100
>d100             s  5.5TB /dev/dsk/c6t60003930000153DA01000000D52AB182d0s0
/dev/dsk/c6t60003930000153EB01000000D52AC9CCd0s0
>
>[root at ds2]/#mount | grep ds-backup
>/ds-backup on /dev/md/dsk/d100
read/write/setuid/devices/intr/largefiles/logging/xattr/onerror=panic/dev=1540064
on Thu Mar 23 00:10:34 2006
>This message posted from opensolaris.org
>_______________________________________________
>zfs-discuss mailing list
>zfs-discuss at opensolaris.org
>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>  
>

Dale Ghent

2006-Mar-23 23:05 UTC

head link

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

I did some more stats gathering against the same NFS server, this time a Solaris
10u1 box is the NFS client:

[b]NFSv2 mount of ZFS volume, writing 256MB file:[/b]
[root at stats]/>mount -o vers=2 ds2.rs:/ds-store/stats/mail /stats
[root at stats]/>time dd if=/dev/zero of=/stats/testfile1 bs=16k count=16384
^C2617+0 records in
2617+0 records out

real    18m55.739s
user    0m0.005s
sys     0m0.146s
--------------

(Because my time is short this evening, I had to interrupt that test, but as you
can see, only 2617 16k blocks (41,872k!) were written over 19 minutes)

[b]NFSv3 mount of ZFS volume, writing 256MB file:[/b]
--------------
[root at stats]/>mount -o vers=3 ds2.rs:/ds-store/stats/mail /stats
[root at stats]/>time dd if=/dev/zero of=/stats/testfile1 bs=16k count=16384
16384+0 records in
16384+0 records out

real    0m27.442s
user    0m0.016s
sys     0m0.933s
--------------

[b]NFSv2 mount of UFS-backed export, writing 256MB file:[/b]
--------------
[root at stats]/>mount -o vers=2 ds2.rs:/ds-backup /stats
[root at stats]/>time dd if=/dev/zero of=/stats/testfile1 bs=16k count=16384
16384+0 records in
16384+0 records out

real    0m25.909s
user    0m0.027s
sys     0m1.161s
--------------

[b]NFSv3 mount of same UFS-backed export, writing 256MB file:[/b]
--------------
[root at stats]/>mount -o vers=3 ds2.rs:/ds-backup /stats
[root at stats]/>time dd if=/dev/zero of=/stats/testfile1 bs=16k count=16384
16384+0 records in
16384+0 records out

real    0m24.916s
user    0m0.023s
sys     0m0.994s
--------------

So you can see with this test, the differences between NFSv2 and v3 protocols
are expected, but the difference between ZFS and UFS as the backing store on
this Solaris 10 client was pretty much the same thing I saw on the
aforementioned Linux client.

/dale
This message posted from opensolaris.org

Dale Ghent

2006-Mar-23 23:42 UTC

head link

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

Here are four snoops, each of ~3000 packets, one set done against the Linux
client and the other against a Solaris 10 client, both mounting a ZFS volume via
NFSv2. The other set is of the two machine writing to the UFS-backed NFS share.
These captures start from the first write:

Captures of Linux and Solaris NFS client writing to a ZFS-backed share:
http://userpages.umbc.edu/~daleg/nfsv2-zfs-linux-client.out.bz2
http://userpages.umbc.edu/~daleg/nfsv2-zfs-solaris10-client.out.bz2

Captures of Linux and Solaris NFS client writing to a UFS-backed share:
http://userpages.umbc.edu/~daleg/nfsv2-ufs-linux-client.out.bz2
http://userpages.umbc.edu/~daleg/nfsv2-ufs-solaris10-client.out.bz2

HTH

/dale
This message posted from opensolaris.org

Dale Ghent

2006-Mar-23 23:51 UTC

head link

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

> and you have a clean network?  what type of machines
> (and are they 
> running 32 or 64bit) do you have?  was the server
> doing anything else?
The NFS server in my case is a X4100 running b35 with 4GB of RAM with two 7TB
Apple Xserve RAIDs attached over a SAN. One Xserve RAID is RAID5 and UFS, the
other is configured as a JBOD and has its 14 disks in one ZFS pool with raidz.

The Linux client is a 8x2Ghz Xeon box with 24GB of RAM. The Solaris client is
another X4100 with 4GB RAM running s10u1.

The network''s pretty clean. Both NFS clients are on the same switch as
the NFS server.
> i assume its reproducible - if so, a snoop trace is
> always good... plus 
> you can dtrace zfs to see what''s slow.
I''ll give that a shot.
> From having worked on NFS, i can say that running
>  2.4 linux NFS bits is 
> uspect, and lots of improvements have been made in
> the 2.6 tree.
Yeah, I was suspect of it being Linux 2.4 as well, but then for grins I did the
same test with the UFS-backed NFS share from the same NFS server and those
results were more inline with my expectations.

To further isolate the variables, I tried the same tests from a Solaris host
(see previous posts from me in this thread) and I got the same results... so
while I agree that the Linux NFS impementation has never been stellar, I
don''t really think it''s at fault in this case.

/dale
This message posted from opensolaris.org

Jeff Bonwick

2006-Mar-24 02:03 UTC

head link

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

> Ooof. 170 minutes to write a 256MB file.
Ooof indeed.  Just to eliminate a variable, what kind of performance
are you getting locally?

BTW, this is a particularly interesting case because you''ve enabled
compression and you''re writing a file full of zeroes.  When we detect
all zeroes, we simply don''t allocate a block, so in the local case
there should be no disk I/O at all (except a little trickle of mtime
updates on the znode for that file).  So the all-zeroes time would
still be interesting, but I''d also be curious what you get with
non-zero data.

When you use NFS, it will periodically fsync() the file, which forces
us to write to the intent log.  It''s possible that lots of synchronous
writes to a wide RAID-Z are slowing things down, although even then
a meg a minute is hard to explain -- unless, perhaps, these drives
are really slow at the SYNCHRONIZE_CACHE SCSI command.  You could
measure this with dtrace by timing zil_flush_vdevs() -- if it''s taking
more than a couple of milliseconds (wall-clock time), that''s bad.

Finally, a best-practices note: 14 disks is on the wide side for
good RAID-Z performance.  If you get a chance to reconfigure the
pool, I''d suggest keeping it in the single-digit range.  In other
words, instead of this:

	zpool create ds-store raidz a b c d e f g h i j k l m n

something more like this:

	zpool create ds-store raidz a b c d e f g raidz h i j k l m n

This will give you two seven-wide RAID-Z groups, which should
perform a bit better, especially when doing many small I/Os.
(This is a matter of percentages, though; it doesn''t explain
the meg-a-minute behavior you''re seeing.)

Jeff

Bill Sommerfeld

2006-Mar-24 03:00 UTC

head link

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

On Thu, 2006-03-23 at 21:03, Jeff Bonwick wrote:> Finally, a best-practices note: 14 disks is on the wide side for
> good RAID-Z performance.  If you get a chance to reconfigure the
> pool, I''d suggest keeping it in the single-digit range.  In other
> words, instead of this:
> 
> 	zpool create ds-store raidz a b c d e f g h i j k l m n
> 
> something more like this:
> 
> 	zpool create ds-store raidz a b c d e f g raidz h i j k l m n
Overly wide raidz groups seems to be an unfenced hole that people new to
ZFS fall into on a regular basis.  

The man page warns against this but that doesn''t seem to be sufficient.

Given that zfs has relatively few such traps, perhaps large raidz groups
ought to be implicitly split up absent a "Yes, I want to be stupid"
flag..

					- Bill

Patrick Bachmann

2006-Mar-24 04:17 UTC

head link

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

Hey Bill,

Bill Sommerfeld wrote:> Overly wide raidz groups seems to be an unfenced hole that people new to
> ZFS fall into on a regular basis.  
> 
> The man page warns against this but that doesn''t seem to be
sufficient.
> 
> Given that zfs has relatively few such traps, perhaps large raidz groups
> ought to be implicitly split up absent a "Yes, I want to be
stupid"
> flag..
IMHO it is sufficient to just document this best-practice.
Neither the "ZFS Administration Guide" nor the ZFS FAQ mention 
it. When I try to get to know a new technology and it''s
"traps",
I normally first check out the FAQ; so appending an answer to it 
would really do the job.

Greetings,

Patrick

Dale Ghent

2006-Mar-24 04:40 UTC

head link

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

Thanks for the suggestions, Jeff.

Since my last post on this topic, I came home and had a good dinner and sat back
down to look at this issue. It seems that NFSv3 on ZFS is all well and good but
it''s v2 that''s the problem.

So before destroying my 14 disk zpool and remaking it to your suggestions, I ran
some more dd tests using /dev/urandom as the source instead of /dev/zero.

[b]Solaris 10u1 client NFSv2 mounting ZFS-backed share, writing 256MB file with
data sourced from /dev/urandom and then /dev/zero:[/b]
-------------
[root at stats]/>time dd if=/dev/urandom of=/stats/testfile1 bs=16k
count=16384
0+16384 records in
0+16384 records out

real    4m28.313s
user    0m0.007s
sys     0m0.850s

[root at stats]/>time dd if=/dev/zero of=/stats/testfile2 bs=16k count=16384
^C13877+0 records in
13877+0 records out

real    59m14.522s
user    0m0.025s
sys     0m0.808s
-------------

Okay, well, writing a file from /dev/urandom took much less time than writing a
file full of nulls. Immediately after quitting that last dd after an hour, I
wrote another file from urandom again, just to what would happen:

-------------
[root at stats]/>time dd if=/dev/urandom of=/stats/testfile3 bs=16k
count=16384
0+16384 records in
0+16384 records out

real    5m0.644s
user    0m0.007s
sys     0m0.837s
-------------

Looks just like the first time. So now it looks like NFSv2+ZFS+nulls
isn''t having it.

I leave tomorrow p.m. for a week overseas, so I doubt I''ll have time to
remake the zpool as prescribed and re-run the tests, but I definitely will after
I return.

/dale
This message posted from opensolaris.org

Richard Elling

2006-Mar-24 05:44 UTC

head link

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

> Bill Sommerfeld wrote:
> > Overly wide raidz groups seems to be an unfenced hole that people new
to
> > ZFS fall into on a regular basis.  
> > 
> > The man page warns against this but that doesn''t seem to be
sufficient.
> > 
> > Given that zfs has relatively few such traps, perhaps large raidz
groups
> > ought to be implicitly split up absent a "Yes, I want to be
stupid"
> > flag..
> 
> IMHO it is sufficient to just document this best-practice.
Disagree.  People don''t read documentation.  It makes good
sense to prevent poor configurations, absent an intentional
override [*].

However, at this time I don''t think we know where the boundaries
are.  The interplay of performance, RAS, and cost is not obivous.
In particular, performance is not fully characterized and the cost
drops continuously.  More work needed here...

[*] for example, prior to ZFS it was considered a poor practice
to mirror onto the same disk.  With ZFS, it can make sense in
that it protects against certain failure modes which other mirroring
technologies do not cover.  But by default, we''d probably prefer
to avoid mirroring onto the same disk unless the user explicitly
desires it.
 -- richard
This message posted from opensolaris.org

Patrick Bachmann

2006-Mar-24 07:40 UTC

head link

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

Hey Richard,

Richard Elling wrote:> Disagree.  People don''t read documentation.  It makes good
> sense to prevent poor configurations, absent an intentional
> override [*].
Ok, you got a point there. People really don''t read docs.
And you''re also right that it makes sense to prevent poor 
configurations.
> However, at this time I don''t think we know where the boundaries
> are.  The interplay of performance, RAS, and cost is not obivous.
> In particular, performance is not fully characterized and the cost
> drops continuously.  More work needed here...
And until these kinds of configurations can and will be done 
"implicitly", it would be great to have this recommended-practice 
documented at least in the ZFS FAQ. To be honest, I read over 
that one subordinate clause, saying that the recommended number 
of devices in a raidz is between three and nine; and there seem 
to be others like me, that are getting really excited about ZFS 
but miss out on that kind of info, when one is skimming over the 
manual. A "bullet" on the FAQ for this info would be really 
helpful and guide folks the way to get the best out of ZFS. :)

Greetings,

Patrick

Spencer Shepler

2006-Mar-24 08:21 UTC

head link

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

On Thu, Jeff Bonwick wrote:> > Ooof. 170 minutes to write a 256MB file.
> 
> Ooof indeed.  Just to eliminate a variable, what kind of performance
> are you getting locally?
> 
> BTW, this is a particularly interesting case because you''ve
enabled
> compression and you''re writing a file full of zeroes.  When we
detect
> all zeroes, we simply don''t allocate a block, so in the local case
> there should be no disk I/O at all (except a little trickle of mtime
> updates on the znode for that file).  So the all-zeroes time would
> still be interesting, but I''d also be curious what you get with
> non-zero data.
> 
> When you use NFS, it will periodically fsync() the file, which forces
> us to write to the intent log.  It''s possible that lots of
synchronous
Remember that in the case of NFSv2 (max write size of 8k), each
protocol WRITE must be synchronously written on disk.  The Solaris
server attempts to "collect" a set of write requests such that 
a larger write/sync is done.  So, the server will do
a VOP_WRITE followed by a VOP_PUTPAGE for the collected write range.

Spencer

Casper.Dik at Sun.COM

2006-Mar-24 09:01 UTC

head link

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

>Thanks for the suggestions, Jeff.
>
>Since my last post on this topic, I came home and had a good dinner and sat
back down to look at this issue. It seems that NFSv3 on ZFS is all well and good but it''s v2
that''s the problem.>
>So before destroying my 14 disk zpool and remaking it to your suggestions, I
ran some more dd tests using /dev/urandom as the source instead of /dev/zero.
>
>[b]Solaris 10u1 client NFSv2 mounting ZFS-backed share, writing 256MB file
with data sourced from
/dev/urandom and then /dev/zero:[/b]>-------------
>[root at stats]/>time dd if=/dev/urandom of=/stats/testfile1 bs=16k
count=16384
>0+16384 records in
>0+16384 records out
Note: these are incomplete records!!  Use obs=16k

Casper

Robert Milkowski

2006-Mar-24 11:05 UTC

head link

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

Hello Jeff,

Friday, March 24, 2006, 3:03:09 AM, you wrote:
>> Ooof. 170 minutes to write a 256MB file.
JB> Ooof indeed.  Just to eliminate a variable, what kind of performance
JB> are you getting locally?

JB> BTW, this is a particularly interesting case because you''ve
enabled
JB> compression and you''re writing a file full of zeroes.  When we
detect
JB> all zeroes, we simply don''t allocate a block, so in the local
case
JB> there should be no disk I/O at all (except a little trickle of mtime
JB> updates on the znode for that file).  So the all-zeroes time would
JB> still be interesting, but I''d also be curious what you get with
JB> non-zero data.

JB> When you use NFS, it will periodically fsync() the file, which forces
JB> us to write to the intent log.  It''s possible that lots of
synchronous
JB> writes to a wide RAID-Z are slowing things down, although even then
JB> a meg a minute is hard to explain -- unless, perhaps, these drives
JB> are really slow at the SYNCHRONIZE_CACHE SCSI command.  You could
JB> measure this with dtrace by timing zil_flush_vdevs() -- if it''s
taking
JB> more than a couple of milliseconds (wall-clock time), that''s
bad.

JB> Finally, a best-practices note: 14 disks is on the wide side for
JB> good RAID-Z performance.  If you get a chance to reconfigure the
JB> pool, I''d suggest keeping it in the single-digit range.  In
other
JB> words, instead of this:

JB>         zpool create ds-store raidz a b c d e f g h i j k l m n

JB> something more like this:

JB>         zpool create ds-store raidz a b c d e f g raidz h i j k l m n

JB> This will give you two seven-wide RAID-Z groups, which should
JB> perform a bit better, especially when doing many small I/Os.
JB> (This is a matter of percentages, though; it doesn''t explain
JB> the meg-a-minute behavior you''re seeing.)

And why is that? I can see that such a config could give better disk
failure tolerance but why performance?


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Bev Crair

2006-Mar-24 15:48 UTC

head link

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

Folks,
There''s been a lot of discussion about elements for an FAQ for ZFS. 
Sun
can certainly provide one.  IMHO, though, it would be really great for 
members of this community to add to the community one already available 
on OpenSolaris (http://opensolaris.org/os/community/zfs/faq/).  Sun 
could provide links from our BigAdmin website 
(http://www.sun.com/bigadmin/home/index.html) to the ZFS community on 
OpenSolaris.

If folks would share the list of questions (beyond the ones that are 
already there), and even answers, we''ll get things pulled together and 
added to the FAQ.
Thanks,
Bev Crair
Director, Solaris KISS

Patrick Bachmann wrote:> Hey Richard,
> 
> Richard Elling wrote:
>> Disagree.  People don''t read documentation.  It makes good
>> sense to prevent poor configurations, absent an intentional
>> override [*].
> 
> Ok, you got a point there. People really don''t read docs.
> And you''re also right that it makes sense to prevent poor
configurations.
> 
>> However, at this time I don''t think we know where the
boundaries
>> are.  The interplay of performance, RAS, and cost is not obivous.
>> In particular, performance is not fully characterized and the cost
>> drops continuously.  More work needed here...
> 
> And until these kinds of configurations can and will be done 
> "implicitly", it would be great to have this recommended-practice
> documented at least in the ZFS FAQ. To be honest, I read over that one 
> subordinate clause, saying that the recommended number of devices in a 
> raidz is between three and nine; and there seem to be others like me, 
> that are getting really excited about ZFS but miss out on that kind of 
> info, when one is skimming over the manual. A "bullet" on the FAQ
for
> this info would be really helpful and guide folks the way to get the 
> best out of ZFS. :)
> 
> Greetings,
> 
> Patrick
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

eric kustarz

2006-Mar-24 17:41 UTC

head link

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

Dale Ghent wrote:
>Here are four snoops, each of ~3000 packets, one set done against the Linux
client and the other against a Solaris 10 client, both mounting a ZFS volume via
NFSv2. The other set is of the two machine writing to the UFS-backed NFS share.
These captures start from the first write:
>
>Captures of Linux and Solaris NFS client writing to a ZFS-backed share:
>http://userpages.umbc.edu/~daleg/nfsv2-zfs-linux-client.out.bz2
>http://userpages.umbc.edu/~daleg/nfsv2-zfs-solaris10-client.out.bz2
>
>Captures of Linux and Solaris NFS client writing to a UFS-backed share:
>http://userpages.umbc.edu/~daleg/nfsv2-ufs-linux-client.out.bz2
>http://userpages.umbc.edu/~daleg/nfsv2-ufs-solaris10-client.out.bz2
>  
>
So somewhat expected (after seeing your results), with zfs you''ll see 
delays in write replies:

winter1% snoop -i nfsv2-zfs-solaris10-client.out -p2915,2925
2915   0.00000 130.85.24.10 -> 130.85.24.9  TCP D=1000 S=2049 
Ack=859052744 Seq=3064684420 Len=0 Win=48180
2916   0.00018  130.85.24.9 -> 130.85.24.10 TCP D=2049 S=1000 
Ack=3064684420 Seq=859052744 Len=1460 Win=49640
2917   0.00000  130.85.24.9 -> 130.85.24.10 TCP D=2049 S=1000 Push 
Ack=3064684420 Seq=859054204 Len=1068 Win=49640
2918   0.00004 130.85.24.10 -> 130.85.24.9  TCP D=1000 S=2049 
Ack=859055272 Seq=3064684420 Len=0 Win=49640
2919   2.30632 130.85.24.10 -> 130.85.24.9  RPC R XID=4156007980 Success
2920   0.00003 130.85.24.10 -> 130.85.24.9  RPC R XID=4172785196 Success
2921   0.00000 130.85.24.10 -> 130.85.24.9  RPC R XID=4206339628 Success
2922   0.00000 130.85.24.10 -> 130.85.24.9  RPC R XID=4189562412 Success


Notice packet 2919 took over 2 seconds later to reply.  You''ll see the 
same thing if the client is linux (using UDP) or solaris (using TCP).

With ufs, there is no long delay.

I can''t reproduce this running snv_35, compression on, and using a 4 
disk raidz - but something is amiss in your setup.

Your storage setup looks ok - i guess it would be nice if you could 
switch the Xserve RAIDs to see if thats the problem, but that''s
probably
not possible.

dtracing to see how long zfs_write() / zil_commit() / zil_flush_vdevs() 
/ and ldi_strategy()  are taking would be good.

eric

Cindy Swearingen

2006-Mar-25 00:30 UTC

head link

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

Patrick,

Thanks for the doc feedback.

I''ve drafted some information about the raidz performance issue in the
current ZFS admin guide. We''ll continue to collect best practices
information over the next month or so for the docs, FAQs, etc.

Jeff, the ZFS team, and this forum will have much to contribute, but
we''re still in the gathering stages. :-)

Cindy

Patrick Bachmann wrote:> Hey Bill,
> 
> Bill Sommerfeld wrote:
> 
>> Overly wide raidz groups seems to be an unfenced hole that people new
to
>> ZFS fall into on a regular basis. 
>> The man page warns against this but that doesn''t seem to be
sufficient.
>>
>> Given that zfs has relatively few such traps, perhaps large raidz
groups
>> ought to be implicitly split up absent a "Yes, I want to be
stupid"
>> flag..
> 
> 
> IMHO it is sufficient to just document this best-practice.
> Neither the "ZFS Administration Guide" nor the ZFS FAQ mention
it. When
> I try to get to know a new technology and it''s "traps",
I normally first
> check out the FAQ; so appending an answer to it would really do the job.
> 
> Greetings,
> 
> Patrick
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Possibly Parallel Threads

Search for more maybe matching threads

zfs discuss - Mar 2006 - Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes

[zfs-discuss] Poor performance on NFS-exported ZFS volumes

Possibly Parallel Threads