Dale Ghent
2006-Mar-23 22:10 UTC
[zfs-discuss] Poor performance on NFS-exported ZFS volumes
I''m seeing some pretty pitiful performance using ZFS on a NFS server, with a ZFS volume exported (only with rw=host.foo.com,root=host.foo.com opts) and mounted on a Linux host running kernel 2.4.31. This linux kernel I''m working with is limited in that I can only do NFSv2 mounts... irregardless of that aspect, I''m sure something''s amiss. I mounted the zfs-based nfs share on the linux host and tested it like so and ran tests with dd: --------------- [root at hercules]/>mount -o rw,vers=2,hard,intr,rsize=8192,wsize=8192 ds2.rs:/ds-store/rs/test /umbc/test [root at hercules]/>time dd if=/dev/zero of=/umbc/test/testfile1 bs=16k count=16384 16384+0 records in 16384+0 records out real 170m51.619s user 0m0.060s sys 0m5.720s --------------- Ooof. 170 minutes to write a 256MB file. Sanity-check time, lets try the same thing on a UFS-backed export from the same server: --------------- [root at hercules]/>mount -o rw,vers=2,hard,intr,rsize=8192,wsize=8192 ds2.rs:/ds-backup /umbc/test [root at hercules]/>time dd if=/dev/zero of=/umbc/test/testfile1 bs=16k count=16384 16384+0 records in 16384+0 records out real 0m22.989s user 0m0.040s sys 0m3.880s --------------- 22 seconds. That''s more like it. Something must be wrong with ZFS<->NFS, obviously. Is this a known problem? I see some NFS-related ZFS bugs in the bug tracker, but none of them seem to complain about such egregious slowness, and most are marked as closed anyway. Here are vitals on the NFS server: [root at ds2]/s#uname -a SunOS ds2.rs 5.11 snv_35 i86pc i386 i86pc [root at ds2]/#zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT ds-store 6.31T 1.42M 6.31T 0% ONLINE - [root at ds2]/#zpool status pool: ds-store state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM ds-store ONLINE 0 0 0 raidz ONLINE 0 0 0 c6t60003930000153A501000000D52ADA12d0 ONLINE 0 0 0 c6t60003930000153A5020000006A95D52Ad0 ONLINE 0 0 0 c6t60003930000153A503000000B54A6A95d0 ONLINE 0 0 0 c6t60003930000153A5040000005AA5B54Ad0 ONLINE 0 0 0 c6t60003930000153A505000000AD525AA5d0 ONLINE 0 0 0 c6t60003930000153A50600000056A9AD52d0 ONLINE 0 0 0 c6t60003930000153A507000000AB5456A9d0 ONLINE 0 0 0 c6t600039300001546301000000D52AC2B9d0 ONLINE 0 0 0 c6t6000393000015463020000006A95D52Ad0 ONLINE 0 0 0 c6t600039300001546303000000B54A6A95d0 ONLINE 0 0 0 c6t6000393000015463040000005AA5B54Ad0 ONLINE 0 0 0 c6t600039300001546305000000AD525AA5d0 ONLINE 0 0 0 c6t60003930000154630600000056A9AD52d0 ONLINE 0 0 0 c6t600039300001546307000000AB5456A9d0 ONLINE 0 0 0 [root at ds2]/#zfs get all ds-store/rs/test NAME PROPERTY VALUE SOURCE ds-store/rs/test type filesystem - ds-store/rs/test creation Thu Mar 23 11:23 2006 - ds-store/rs/test used 149K - ds-store/rs/test available 35.0G - ds-store/rs/test referenced 149K - ds-store/rs/test compressratio 1.00x - ds-store/rs/test mounted yes - ds-store/rs/test quota 35G local ds-store/rs/test reservation none default ds-store/rs/test recordsize 128K default ds-store/rs/test mountpoint /ds-store/rs/test default ds-store/rs/test sharenfs rw=hercules,root=hercules local ds-store/rs/test checksum on default ds-store/rs/test compression on inherited from ds-store ds-store/rs/test atime on default ds-store/rs/test devices on default ds-store/rs/test exec on default ds-store/rs/test setuid on default ds-store/rs/test readonly off default ds-store/rs/test zoned off default ds-store/rs/test snapdir visible default ds-store/rs/test aclmode groupmask default ds-store/rs/test aclinherit secure default The NFS-exported UFS file system on this same server: [root at ds2]/#metastat -c d100 d100 s 5.5TB /dev/dsk/c6t60003930000153DA01000000D52AB182d0s0 /dev/dsk/c6t60003930000153EB01000000D52AC9CCd0s0 [root at ds2]/#mount | grep ds-backup /ds-backup on /dev/md/dsk/d100 read/write/setuid/devices/intr/largefiles/logging/xattr/onerror=panic/dev=1540064 on Thu Mar 23 00:10:34 2006 This message posted from opensolaris.org
Noel Dellofano
2006-Mar-23 22:44 UTC
[zfs-discuss] Poor performance on NFS-exported ZFS volumes
Sounds like something isn''t quite happy. Could you grab a snoop trace while you''re doing the dd? On the server become root and do: #snoop -o snoop.out <clientname> thanks, Noel :-) ************************************************************************ ** "Question all the answers" On Mar 23, 2006, at 2:10 PM, Dale Ghent wrote:> I''m seeing some pretty pitiful performance using ZFS on a NFS server, > with a ZFS volume exported (only with > rw=host.foo.com,root=host.foo.com opts) and mounted on a Linux host > running kernel 2.4.31. This linux kernel I''m working with is limited > in that I can only do NFSv2 mounts... irregardless of that aspect, I''m > sure something''s amiss. > > I mounted the zfs-based nfs share on the linux host and tested it like > so and ran tests with dd: > > --------------- > [root at hercules]/>mount -o rw,vers=2,hard,intr,rsize=8192,wsize=8192 > ds2.rs:/ds-store/rs/test /umbc/test > [root at hercules]/>time dd if=/dev/zero of=/umbc/test/testfile1 bs=16k > count=16384 > 16384+0 records in > 16384+0 records out > > real 170m51.619s > user 0m0.060s > sys 0m5.720s > --------------- > > Ooof. 170 minutes to write a 256MB file. Sanity-check time, lets try > the same thing on a UFS-backed export from the same server: > > --------------- > [root at hercules]/>mount -o rw,vers=2,hard,intr,rsize=8192,wsize=8192 > ds2.rs:/ds-backup /umbc/test > [root at hercules]/>time dd if=/dev/zero of=/umbc/test/testfile1 bs=16k > count=16384 > 16384+0 records in > 16384+0 records out > > real 0m22.989s > user 0m0.040s > sys 0m3.880s > --------------- > > 22 seconds. That''s more like it. Something must be wrong with > ZFS<->NFS, obviously. Is this a known problem? I see some NFS-related > ZFS bugs in the bug tracker, but none of them seem to complain about > such egregious slowness, and most are marked as closed anyway. > > Here are vitals on the NFS server: > > [root at ds2]/s#uname -a > SunOS ds2.rs 5.11 snv_35 i86pc i386 i86pc > > [root at ds2]/#zpool list > NAME SIZE USED AVAIL CAP HEALTH ALTROOT > ds-store 6.31T 1.42M 6.31T 0% ONLINE - > > [root at ds2]/#zpool status > pool: ds-store > state: ONLINE > scrub: none requested > config: > NAME STATE READ > WRITE CKSUM > ds-store ONLINE 0 > 0 0 > raidz ONLINE 0 > 0 0 > c6t60003930000153A501000000D52ADA12d0 ONLINE 0 > 0 0 > c6t60003930000153A5020000006A95D52Ad0 ONLINE 0 > 0 0 > c6t60003930000153A503000000B54A6A95d0 ONLINE 0 > 0 0 > c6t60003930000153A5040000005AA5B54Ad0 ONLINE 0 > 0 0 > c6t60003930000153A505000000AD525AA5d0 ONLINE 0 > 0 0 > c6t60003930000153A50600000056A9AD52d0 ONLINE 0 > 0 0 > c6t60003930000153A507000000AB5456A9d0 ONLINE 0 > 0 0 > c6t600039300001546301000000D52AC2B9d0 ONLINE 0 > 0 0 > c6t6000393000015463020000006A95D52Ad0 ONLINE 0 > 0 0 > c6t600039300001546303000000B54A6A95d0 ONLINE 0 > 0 0 > c6t6000393000015463040000005AA5B54Ad0 ONLINE 0 > 0 0 > c6t600039300001546305000000AD525AA5d0 ONLINE 0 > 0 0 > c6t60003930000154630600000056A9AD52d0 ONLINE 0 > 0 0 > c6t600039300001546307000000AB5456A9d0 ONLINE 0 > 0 0 > > [root at ds2]/#zfs get all ds-store/rs/test > NAME PROPERTY VALUE SOURCE > ds-store/rs/test type filesystem - > ds-store/rs/test creation Thu Mar 23 11:23 2006 - > ds-store/rs/test used 149K - > ds-store/rs/test available 35.0G - > ds-store/rs/test referenced 149K - > ds-store/rs/test compressratio 1.00x - > ds-store/rs/test mounted yes - > ds-store/rs/test quota 35G local > ds-store/rs/test reservation none default > ds-store/rs/test recordsize 128K default > ds-store/rs/test mountpoint /ds-store/rs/test default > ds-store/rs/test sharenfs rw=hercules,root=hercules local > ds-store/rs/test checksum on default > ds-store/rs/test compression on inherited > from ds-store > ds-store/rs/test atime on default > ds-store/rs/test devices on default > ds-store/rs/test exec on default > ds-store/rs/test setuid on default > ds-store/rs/test readonly off default > ds-store/rs/test zoned off default > ds-store/rs/test snapdir visible default > ds-store/rs/test aclmode groupmask default > ds-store/rs/test aclinherit secure default > > The NFS-exported UFS file system on this same server: > > [root at ds2]/#metastat -c d100 > d100 s 5.5TB > /dev/dsk/c6t60003930000153DA01000000D52AB182d0s0 > /dev/dsk/c6t60003930000153EB01000000D52AC9CCd0s0 > > [root at ds2]/#mount | grep ds-backup > /ds-backup on /dev/md/dsk/d100 > read/write/setuid/devices/intr/largefiles/logging/xattr/onerror=panic/ > dev=1540064 on Thu Mar 23 00:10:34 2006 > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
eric kustarz
2006-Mar-23 22:48 UTC
[zfs-discuss] Poor performance on NFS-exported ZFS volumes
Dale Ghent wrote:>I''m seeing some pretty pitiful performance using ZFS on a NFS server, with a ZFS volume exported (only with rw=host.foo.com,root=host.foo.com opts) and mounted on a Linux host running kernel 2.4.31. This linux kernel I''m working with is limited in that I can only do NFSv2 mounts... irregardless of that aspect, I''m sure something''s amiss. > >I mounted the zfs-based nfs share on the linux host and tested it like so and ran tests with dd: > >--------------- >[root at hercules]/>mount -o rw,vers=2,hard,intr,rsize=8192,wsize=8192 ds2.rs:/ds-store/rs/test /umbc/test >[root at hercules]/>time dd if=/dev/zero of=/umbc/test/testfile1 bs=16k count=16384 >16384+0 records in >16384+0 records out > >real 170m51.619s >user 0m0.060s >sys 0m5.720s >--------------- > >Ooof. 170 minutes to write a 256MB file. Sanity-check time, lets try the same thing on a UFS-backed export from the same server: > > >Hmmm, i just tried this on solaris vs. solaris and got it to complete in about a minute (with the server as 32 or 64 bit) over a dirty network - ufs took about 43 seconds: fsh-mullet# mount -o vers=2 hodur:/z /mnt fsh-mullet# /bin/time dd if=/dev/zero of=/mnt/testfile1 bs=16k count=16384 16384+0 records in 16384+0 records out real 1:06.6 user 0.0 sys 1.7 fsh-mullet# fsh-mullet# mount -o vers=2 hodur:/ufs /mnt fsh-mullet# /bin/time dd if=/dev/zero of=/mnt/testfile1 bs=16k count=16384 16384+0 records in 16384+0 records out real 43.1 user 0.0 sys 1.7 fsh-mullet# The server is a 2way opteron with 3.5G of memory running 3/21 nevada bits. The client is a 2way sparc with 2G of memory running 3/22 nevada bits. and you have a clean network? what type of machines (and are they running 32 or 64bit) do you have? was the server doing anything else? i assume its reproducible - if so, a snoop trace is always good... plus you can dtrace zfs to see what''s slow. From having worked on NFS, i can say that running 2.4 linux NFS bits is suspect, and lots of improvements have been made in the 2.6 tree. Neil just putback some more improvements to build 37 for ZFS w/ NFS, but it shouldn''t matter here... wonder what''s different in your environment. eric>--------------- >[root at hercules]/>mount -o rw,vers=2,hard,intr,rsize=8192,wsize=8192 ds2.rs:/ds-backup /umbc/test >[root at hercules]/>time dd if=/dev/zero of=/umbc/test/testfile1 bs=16k count=16384 >16384+0 records in >16384+0 records out > >real 0m22.989s >user 0m0.040s >sys 0m3.880s >--------------- > >22 seconds. That''s more like it. Something must be wrong with ZFS<->NFS, obviously. Is this a known problem? I see some NFS-related ZFS bugs in the bug tracker, but none of them seem to complain about such egregious slowness, and most are marked as closed anyway. > >Here are vitals on the NFS server: > >[root at ds2]/s#uname -a >SunOS ds2.rs 5.11 snv_35 i86pc i386 i86pc > >[root at ds2]/#zpool list >NAME SIZE USED AVAIL CAP HEALTH ALTROOT >ds-store 6.31T 1.42M 6.31T 0% ONLINE - > >[root at ds2]/#zpool status > pool: ds-store > state: ONLINE > scrub: none requested >config: > NAME STATE READ WRITE CKSUM > ds-store ONLINE 0 0 0 > raidz ONLINE 0 0 0 > c6t60003930000153A501000000D52ADA12d0 ONLINE 0 0 0 > c6t60003930000153A5020000006A95D52Ad0 ONLINE 0 0 0 > c6t60003930000153A503000000B54A6A95d0 ONLINE 0 0 0 > c6t60003930000153A5040000005AA5B54Ad0 ONLINE 0 0 0 > c6t60003930000153A505000000AD525AA5d0 ONLINE 0 0 0 > c6t60003930000153A50600000056A9AD52d0 ONLINE 0 0 0 > c6t60003930000153A507000000AB5456A9d0 ONLINE 0 0 0 > c6t600039300001546301000000D52AC2B9d0 ONLINE 0 0 0 > c6t6000393000015463020000006A95D52Ad0 ONLINE 0 0 0 > c6t600039300001546303000000B54A6A95d0 ONLINE 0 0 0 > c6t6000393000015463040000005AA5B54Ad0 ONLINE 0 0 0 > c6t600039300001546305000000AD525AA5d0 ONLINE 0 0 0 > c6t60003930000154630600000056A9AD52d0 ONLINE 0 0 0 > c6t600039300001546307000000AB5456A9d0 ONLINE 0 0 0 > >[root at ds2]/#zfs get all ds-store/rs/test >NAME PROPERTY VALUE SOURCE >ds-store/rs/test type filesystem - >ds-store/rs/test creation Thu Mar 23 11:23 2006 - >ds-store/rs/test used 149K - >ds-store/rs/test available 35.0G - >ds-store/rs/test referenced 149K - >ds-store/rs/test compressratio 1.00x - >ds-store/rs/test mounted yes - >ds-store/rs/test quota 35G local >ds-store/rs/test reservation none default >ds-store/rs/test recordsize 128K default >ds-store/rs/test mountpoint /ds-store/rs/test default >ds-store/rs/test sharenfs rw=hercules,root=hercules local >ds-store/rs/test checksum on default >ds-store/rs/test compression on inherited from ds-store >ds-store/rs/test atime on default >ds-store/rs/test devices on default >ds-store/rs/test exec on default >ds-store/rs/test setuid on default >ds-store/rs/test readonly off default >ds-store/rs/test zoned off default >ds-store/rs/test snapdir visible default >ds-store/rs/test aclmode groupmask default >ds-store/rs/test aclinherit secure default > >The NFS-exported UFS file system on this same server: > >[root at ds2]/#metastat -c d100 >d100 s 5.5TB /dev/dsk/c6t60003930000153DA01000000D52AB182d0s0 /dev/dsk/c6t60003930000153EB01000000D52AC9CCd0s0 > >[root at ds2]/#mount | grep ds-backup >/ds-backup on /dev/md/dsk/d100 read/write/setuid/devices/intr/largefiles/logging/xattr/onerror=panic/dev=1540064 on Thu Mar 23 00:10:34 2006 >This message posted from opensolaris.org >_______________________________________________ >zfs-discuss mailing list >zfs-discuss at opensolaris.org >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >
Dale Ghent
2006-Mar-23 23:05 UTC
[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes
I did some more stats gathering against the same NFS server, this time a Solaris 10u1 box is the NFS client: [b]NFSv2 mount of ZFS volume, writing 256MB file:[/b] [root at stats]/>mount -o vers=2 ds2.rs:/ds-store/stats/mail /stats [root at stats]/>time dd if=/dev/zero of=/stats/testfile1 bs=16k count=16384 ^C2617+0 records in 2617+0 records out real 18m55.739s user 0m0.005s sys 0m0.146s -------------- (Because my time is short this evening, I had to interrupt that test, but as you can see, only 2617 16k blocks (41,872k!) were written over 19 minutes) [b]NFSv3 mount of ZFS volume, writing 256MB file:[/b] -------------- [root at stats]/>mount -o vers=3 ds2.rs:/ds-store/stats/mail /stats [root at stats]/>time dd if=/dev/zero of=/stats/testfile1 bs=16k count=16384 16384+0 records in 16384+0 records out real 0m27.442s user 0m0.016s sys 0m0.933s -------------- [b]NFSv2 mount of UFS-backed export, writing 256MB file:[/b] -------------- [root at stats]/>mount -o vers=2 ds2.rs:/ds-backup /stats [root at stats]/>time dd if=/dev/zero of=/stats/testfile1 bs=16k count=16384 16384+0 records in 16384+0 records out real 0m25.909s user 0m0.027s sys 0m1.161s -------------- [b]NFSv3 mount of same UFS-backed export, writing 256MB file:[/b] -------------- [root at stats]/>mount -o vers=3 ds2.rs:/ds-backup /stats [root at stats]/>time dd if=/dev/zero of=/stats/testfile1 bs=16k count=16384 16384+0 records in 16384+0 records out real 0m24.916s user 0m0.023s sys 0m0.994s -------------- So you can see with this test, the differences between NFSv2 and v3 protocols are expected, but the difference between ZFS and UFS as the backing store on this Solaris 10 client was pretty much the same thing I saw on the aforementioned Linux client. /dale This message posted from opensolaris.org
Dale Ghent
2006-Mar-23 23:42 UTC
[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes
Here are four snoops, each of ~3000 packets, one set done against the Linux client and the other against a Solaris 10 client, both mounting a ZFS volume via NFSv2. The other set is of the two machine writing to the UFS-backed NFS share. These captures start from the first write: Captures of Linux and Solaris NFS client writing to a ZFS-backed share: http://userpages.umbc.edu/~daleg/nfsv2-zfs-linux-client.out.bz2 http://userpages.umbc.edu/~daleg/nfsv2-zfs-solaris10-client.out.bz2 Captures of Linux and Solaris NFS client writing to a UFS-backed share: http://userpages.umbc.edu/~daleg/nfsv2-ufs-linux-client.out.bz2 http://userpages.umbc.edu/~daleg/nfsv2-ufs-solaris10-client.out.bz2 HTH /dale This message posted from opensolaris.org
Dale Ghent
2006-Mar-23 23:51 UTC
[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes
> and you have a clean network? what type of machines > (and are they > running 32 or 64bit) do you have? was the server > doing anything else?The NFS server in my case is a X4100 running b35 with 4GB of RAM with two 7TB Apple Xserve RAIDs attached over a SAN. One Xserve RAID is RAID5 and UFS, the other is configured as a JBOD and has its 14 disks in one ZFS pool with raidz. The Linux client is a 8x2Ghz Xeon box with 24GB of RAM. The Solaris client is another X4100 with 4GB RAM running s10u1. The network''s pretty clean. Both NFS clients are on the same switch as the NFS server.> i assume its reproducible - if so, a snoop trace is > always good... plus > you can dtrace zfs to see what''s slow.I''ll give that a shot.> From having worked on NFS, i can say that running > 2.4 linux NFS bits is > uspect, and lots of improvements have been made in > the 2.6 tree.Yeah, I was suspect of it being Linux 2.4 as well, but then for grins I did the same test with the UFS-backed NFS share from the same NFS server and those results were more inline with my expectations. To further isolate the variables, I tried the same tests from a Solaris host (see previous posts from me in this thread) and I got the same results... so while I agree that the Linux NFS impementation has never been stellar, I don''t really think it''s at fault in this case. /dale This message posted from opensolaris.org
Jeff Bonwick
2006-Mar-24 02:03 UTC
[zfs-discuss] Poor performance on NFS-exported ZFS volumes
> Ooof. 170 minutes to write a 256MB file.Ooof indeed. Just to eliminate a variable, what kind of performance are you getting locally? BTW, this is a particularly interesting case because you''ve enabled compression and you''re writing a file full of zeroes. When we detect all zeroes, we simply don''t allocate a block, so in the local case there should be no disk I/O at all (except a little trickle of mtime updates on the znode for that file). So the all-zeroes time would still be interesting, but I''d also be curious what you get with non-zero data. When you use NFS, it will periodically fsync() the file, which forces us to write to the intent log. It''s possible that lots of synchronous writes to a wide RAID-Z are slowing things down, although even then a meg a minute is hard to explain -- unless, perhaps, these drives are really slow at the SYNCHRONIZE_CACHE SCSI command. You could measure this with dtrace by timing zil_flush_vdevs() -- if it''s taking more than a couple of milliseconds (wall-clock time), that''s bad. Finally, a best-practices note: 14 disks is on the wide side for good RAID-Z performance. If you get a chance to reconfigure the pool, I''d suggest keeping it in the single-digit range. In other words, instead of this: zpool create ds-store raidz a b c d e f g h i j k l m n something more like this: zpool create ds-store raidz a b c d e f g raidz h i j k l m n This will give you two seven-wide RAID-Z groups, which should perform a bit better, especially when doing many small I/Os. (This is a matter of percentages, though; it doesn''t explain the meg-a-minute behavior you''re seeing.) Jeff
Bill Sommerfeld
2006-Mar-24 03:00 UTC
[zfs-discuss] Poor performance on NFS-exported ZFS volumes
On Thu, 2006-03-23 at 21:03, Jeff Bonwick wrote:> Finally, a best-practices note: 14 disks is on the wide side for > good RAID-Z performance. If you get a chance to reconfigure the > pool, I''d suggest keeping it in the single-digit range. In other > words, instead of this: > > zpool create ds-store raidz a b c d e f g h i j k l m n > > something more like this: > > zpool create ds-store raidz a b c d e f g raidz h i j k l m nOverly wide raidz groups seems to be an unfenced hole that people new to ZFS fall into on a regular basis. The man page warns against this but that doesn''t seem to be sufficient. Given that zfs has relatively few such traps, perhaps large raidz groups ought to be implicitly split up absent a "Yes, I want to be stupid" flag.. - Bill
Patrick Bachmann
2006-Mar-24 04:17 UTC
[zfs-discuss] Poor performance on NFS-exported ZFS volumes
Hey Bill, Bill Sommerfeld wrote:> Overly wide raidz groups seems to be an unfenced hole that people new to > ZFS fall into on a regular basis. > > The man page warns against this but that doesn''t seem to be sufficient. > > Given that zfs has relatively few such traps, perhaps large raidz groups > ought to be implicitly split up absent a "Yes, I want to be stupid" > flag..IMHO it is sufficient to just document this best-practice. Neither the "ZFS Administration Guide" nor the ZFS FAQ mention it. When I try to get to know a new technology and it''s "traps", I normally first check out the FAQ; so appending an answer to it would really do the job. Greetings, Patrick
Dale Ghent
2006-Mar-24 04:40 UTC
[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes
Thanks for the suggestions, Jeff. Since my last post on this topic, I came home and had a good dinner and sat back down to look at this issue. It seems that NFSv3 on ZFS is all well and good but it''s v2 that''s the problem. So before destroying my 14 disk zpool and remaking it to your suggestions, I ran some more dd tests using /dev/urandom as the source instead of /dev/zero. [b]Solaris 10u1 client NFSv2 mounting ZFS-backed share, writing 256MB file with data sourced from /dev/urandom and then /dev/zero:[/b] ------------- [root at stats]/>time dd if=/dev/urandom of=/stats/testfile1 bs=16k count=16384 0+16384 records in 0+16384 records out real 4m28.313s user 0m0.007s sys 0m0.850s [root at stats]/>time dd if=/dev/zero of=/stats/testfile2 bs=16k count=16384 ^C13877+0 records in 13877+0 records out real 59m14.522s user 0m0.025s sys 0m0.808s ------------- Okay, well, writing a file from /dev/urandom took much less time than writing a file full of nulls. Immediately after quitting that last dd after an hour, I wrote another file from urandom again, just to what would happen: ------------- [root at stats]/>time dd if=/dev/urandom of=/stats/testfile3 bs=16k count=16384 0+16384 records in 0+16384 records out real 5m0.644s user 0m0.007s sys 0m0.837s ------------- Looks just like the first time. So now it looks like NFSv2+ZFS+nulls isn''t having it. I leave tomorrow p.m. for a week overseas, so I doubt I''ll have time to remake the zpool as prescribed and re-run the tests, but I definitely will after I return. /dale This message posted from opensolaris.org
Richard Elling
2006-Mar-24 05:44 UTC
[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes
> Bill Sommerfeld wrote: > > Overly wide raidz groups seems to be an unfenced hole that people new to > > ZFS fall into on a regular basis. > > > > The man page warns against this but that doesn''t seem to be sufficient. > > > > Given that zfs has relatively few such traps, perhaps large raidz groups > > ought to be implicitly split up absent a "Yes, I want to be stupid" > > flag.. > > IMHO it is sufficient to just document this best-practice.Disagree. People don''t read documentation. It makes good sense to prevent poor configurations, absent an intentional override [*]. However, at this time I don''t think we know where the boundaries are. The interplay of performance, RAS, and cost is not obivous. In particular, performance is not fully characterized and the cost drops continuously. More work needed here... [*] for example, prior to ZFS it was considered a poor practice to mirror onto the same disk. With ZFS, it can make sense in that it protects against certain failure modes which other mirroring technologies do not cover. But by default, we''d probably prefer to avoid mirroring onto the same disk unless the user explicitly desires it. -- richard This message posted from opensolaris.org
Patrick Bachmann
2006-Mar-24 07:40 UTC
[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes
Hey Richard, Richard Elling wrote:> Disagree. People don''t read documentation. It makes good > sense to prevent poor configurations, absent an intentional > override [*].Ok, you got a point there. People really don''t read docs. And you''re also right that it makes sense to prevent poor configurations.> However, at this time I don''t think we know where the boundaries > are. The interplay of performance, RAS, and cost is not obivous. > In particular, performance is not fully characterized and the cost > drops continuously. More work needed here...And until these kinds of configurations can and will be done "implicitly", it would be great to have this recommended-practice documented at least in the ZFS FAQ. To be honest, I read over that one subordinate clause, saying that the recommended number of devices in a raidz is between three and nine; and there seem to be others like me, that are getting really excited about ZFS but miss out on that kind of info, when one is skimming over the manual. A "bullet" on the FAQ for this info would be really helpful and guide folks the way to get the best out of ZFS. :) Greetings, Patrick
Spencer Shepler
2006-Mar-24 08:21 UTC
[zfs-discuss] Poor performance on NFS-exported ZFS volumes
On Thu, Jeff Bonwick wrote:> > Ooof. 170 minutes to write a 256MB file. > > Ooof indeed. Just to eliminate a variable, what kind of performance > are you getting locally? > > BTW, this is a particularly interesting case because you''ve enabled > compression and you''re writing a file full of zeroes. When we detect > all zeroes, we simply don''t allocate a block, so in the local case > there should be no disk I/O at all (except a little trickle of mtime > updates on the znode for that file). So the all-zeroes time would > still be interesting, but I''d also be curious what you get with > non-zero data. > > When you use NFS, it will periodically fsync() the file, which forces > us to write to the intent log. It''s possible that lots of synchronousRemember that in the case of NFSv2 (max write size of 8k), each protocol WRITE must be synchronously written on disk. The Solaris server attempts to "collect" a set of write requests such that a larger write/sync is done. So, the server will do a VOP_WRITE followed by a VOP_PUTPAGE for the collected write range. Spencer
Casper.Dik at Sun.COM
2006-Mar-24 09:01 UTC
[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes
>Thanks for the suggestions, Jeff. > >Since my last post on this topic, I came home and had a good dinner and sat back down to look at this issue. It seems that NFSv3 on ZFS is all well and good but it''s v2 that''s the problem.> >So before destroying my 14 disk zpool and remaking it to your suggestions, I ran some more dd tests using /dev/urandom as the source instead of /dev/zero.> >[b]Solaris 10u1 client NFSv2 mounting ZFS-backed share, writing 256MB file with data sourced from/dev/urandom and then /dev/zero:[/b]>------------- >[root at stats]/>time dd if=/dev/urandom of=/stats/testfile1 bs=16k count=16384 >0+16384 records in >0+16384 records outNote: these are incomplete records!! Use obs=16k Casper
Robert Milkowski
2006-Mar-24 11:05 UTC
[zfs-discuss] Poor performance on NFS-exported ZFS volumes
Hello Jeff, Friday, March 24, 2006, 3:03:09 AM, you wrote:>> Ooof. 170 minutes to write a 256MB file.JB> Ooof indeed. Just to eliminate a variable, what kind of performance JB> are you getting locally? JB> BTW, this is a particularly interesting case because you''ve enabled JB> compression and you''re writing a file full of zeroes. When we detect JB> all zeroes, we simply don''t allocate a block, so in the local case JB> there should be no disk I/O at all (except a little trickle of mtime JB> updates on the znode for that file). So the all-zeroes time would JB> still be interesting, but I''d also be curious what you get with JB> non-zero data. JB> When you use NFS, it will periodically fsync() the file, which forces JB> us to write to the intent log. It''s possible that lots of synchronous JB> writes to a wide RAID-Z are slowing things down, although even then JB> a meg a minute is hard to explain -- unless, perhaps, these drives JB> are really slow at the SYNCHRONIZE_CACHE SCSI command. You could JB> measure this with dtrace by timing zil_flush_vdevs() -- if it''s taking JB> more than a couple of milliseconds (wall-clock time), that''s bad. JB> Finally, a best-practices note: 14 disks is on the wide side for JB> good RAID-Z performance. If you get a chance to reconfigure the JB> pool, I''d suggest keeping it in the single-digit range. In other JB> words, instead of this: JB> zpool create ds-store raidz a b c d e f g h i j k l m n JB> something more like this: JB> zpool create ds-store raidz a b c d e f g raidz h i j k l m n JB> This will give you two seven-wide RAID-Z groups, which should JB> perform a bit better, especially when doing many small I/Os. JB> (This is a matter of percentages, though; it doesn''t explain JB> the meg-a-minute behavior you''re seeing.) And why is that? I can see that such a config could give better disk failure tolerance but why performance? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Bev Crair
2006-Mar-24 15:48 UTC
[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes
Folks, There''s been a lot of discussion about elements for an FAQ for ZFS. Sun can certainly provide one. IMHO, though, it would be really great for members of this community to add to the community one already available on OpenSolaris (http://opensolaris.org/os/community/zfs/faq/). Sun could provide links from our BigAdmin website (http://www.sun.com/bigadmin/home/index.html) to the ZFS community on OpenSolaris. If folks would share the list of questions (beyond the ones that are already there), and even answers, we''ll get things pulled together and added to the FAQ. Thanks, Bev Crair Director, Solaris KISS Patrick Bachmann wrote:> Hey Richard, > > Richard Elling wrote: >> Disagree. People don''t read documentation. It makes good >> sense to prevent poor configurations, absent an intentional >> override [*]. > > Ok, you got a point there. People really don''t read docs. > And you''re also right that it makes sense to prevent poor configurations. > >> However, at this time I don''t think we know where the boundaries >> are. The interplay of performance, RAS, and cost is not obivous. >> In particular, performance is not fully characterized and the cost >> drops continuously. More work needed here... > > And until these kinds of configurations can and will be done > "implicitly", it would be great to have this recommended-practice > documented at least in the ZFS FAQ. To be honest, I read over that one > subordinate clause, saying that the recommended number of devices in a > raidz is between three and nine; and there seem to be others like me, > that are getting really excited about ZFS but miss out on that kind of > info, when one is skimming over the manual. A "bullet" on the FAQ for > this info would be really helpful and guide folks the way to get the > best out of ZFS. :) > > Greetings, > > Patrick > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
eric kustarz
2006-Mar-24 17:41 UTC
[zfs-discuss] Re: Poor performance on NFS-exported ZFS volumes
Dale Ghent wrote:>Here are four snoops, each of ~3000 packets, one set done against the Linux client and the other against a Solaris 10 client, both mounting a ZFS volume via NFSv2. The other set is of the two machine writing to the UFS-backed NFS share. These captures start from the first write: > >Captures of Linux and Solaris NFS client writing to a ZFS-backed share: >http://userpages.umbc.edu/~daleg/nfsv2-zfs-linux-client.out.bz2 >http://userpages.umbc.edu/~daleg/nfsv2-zfs-solaris10-client.out.bz2 > >Captures of Linux and Solaris NFS client writing to a UFS-backed share: >http://userpages.umbc.edu/~daleg/nfsv2-ufs-linux-client.out.bz2 >http://userpages.umbc.edu/~daleg/nfsv2-ufs-solaris10-client.out.bz2 > >So somewhat expected (after seeing your results), with zfs you''ll see delays in write replies: winter1% snoop -i nfsv2-zfs-solaris10-client.out -p2915,2925 2915 0.00000 130.85.24.10 -> 130.85.24.9 TCP D=1000 S=2049 Ack=859052744 Seq=3064684420 Len=0 Win=48180 2916 0.00018 130.85.24.9 -> 130.85.24.10 TCP D=2049 S=1000 Ack=3064684420 Seq=859052744 Len=1460 Win=49640 2917 0.00000 130.85.24.9 -> 130.85.24.10 TCP D=2049 S=1000 Push Ack=3064684420 Seq=859054204 Len=1068 Win=49640 2918 0.00004 130.85.24.10 -> 130.85.24.9 TCP D=1000 S=2049 Ack=859055272 Seq=3064684420 Len=0 Win=49640 2919 2.30632 130.85.24.10 -> 130.85.24.9 RPC R XID=4156007980 Success 2920 0.00003 130.85.24.10 -> 130.85.24.9 RPC R XID=4172785196 Success 2921 0.00000 130.85.24.10 -> 130.85.24.9 RPC R XID=4206339628 Success 2922 0.00000 130.85.24.10 -> 130.85.24.9 RPC R XID=4189562412 Success Notice packet 2919 took over 2 seconds later to reply. You''ll see the same thing if the client is linux (using UDP) or solaris (using TCP). With ufs, there is no long delay. I can''t reproduce this running snv_35, compression on, and using a 4 disk raidz - but something is amiss in your setup. Your storage setup looks ok - i guess it would be nice if you could switch the Xserve RAIDs to see if thats the problem, but that''s probably not possible. dtracing to see how long zfs_write() / zil_commit() / zil_flush_vdevs() / and ldi_strategy() are taking would be good. eric
Cindy Swearingen
2006-Mar-25 00:30 UTC
[zfs-discuss] Poor performance on NFS-exported ZFS volumes
Patrick, Thanks for the doc feedback. I''ve drafted some information about the raidz performance issue in the current ZFS admin guide. We''ll continue to collect best practices information over the next month or so for the docs, FAQs, etc. Jeff, the ZFS team, and this forum will have much to contribute, but we''re still in the gathering stages. :-) Cindy Patrick Bachmann wrote:> Hey Bill, > > Bill Sommerfeld wrote: > >> Overly wide raidz groups seems to be an unfenced hole that people new to >> ZFS fall into on a regular basis. >> The man page warns against this but that doesn''t seem to be sufficient. >> >> Given that zfs has relatively few such traps, perhaps large raidz groups >> ought to be implicitly split up absent a "Yes, I want to be stupid" >> flag.. > > > IMHO it is sufficient to just document this best-practice. > Neither the "ZFS Administration Guide" nor the ZFS FAQ mention it. When > I try to get to know a new technology and it''s "traps", I normally first > check out the FAQ; so appending an answer to it would really do the job. > > Greetings, > > Patrick > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss