Hello all, After setting up a Solaris 10 machine with ZFS as the new NFS server, I''m stumped by some serious performance problems. Here are the (admittedly long) details (also noted at http://www.netmeister.org/blog/): The machine in question is a dual-amd64 box with 2GB RAM and two broadcom gigabit NICs. The OS is Solaris 10 6/06 and the filesystem consists of a single zpool stripe across the two halfs of an Apple XRaid (each half configured as RAID5), providing a pool of 5.4 TB. On the pool, I''ve created a total of 60 filesystems, each of them shared via NFS, each of them with compression turned on. The clients (NetBSD) mount the filesystems with ''-U -r32768 -w32768'', and initially everything looks just fine. (The clients also do NIS against a different server.) Running ''ls'' and ''ls -l'' on a large directory looks just fine, upon first inspection. Reading the filesystem works fine, too: (1) Running a ''find . -type f -print'' on a directory with a total of 46450 files/subdirectories in it takes about 90 seconds, yielding an average I/O size of 64 at around 2000 kr/s according to iostat(1M). (2) Running a ''dd if=/dev/zero of=blah bs=1024k count=128'' takes about 18 seconds at almost 7MB/s (this is a 10/100 network). To compare how this measures up when not doing any file I/O, I ran ''dd if=/dev/zero bs=1024k count=128 | ssh remotehost "cat - >/dev/null"'', which took about 13 seconds. (3) Reading from the NFS share (''dd if=blah of=/dev/null bs=1024k'') takes about 12 seconds. All of this is perfectly acceptable. Compared with the old NFS server (which runs on IRIX), we get: (1) takes significantly longer on IRIX: about 300 seconds (2) is somewhat faster on IRIX: it takes about 14 seconds (3) takes about the same (around 12 seconds) (The comparison is not entirely fair, however: the NFS share mounted from the IRIX machine is also exported to about 90 other clients, and does see its fair share of I/O, while the Solaris NFS share is only mounted on this one client.) Alright, so what''s my beef? Well, here''s the fun part: when I try to actually use this NFS share as my home directory (as I do with the IRIX NFS mount), then somehow performance plummets. Reading my inbox (~/.mail) will take around 20 seconds (even though it has only 60 messages in it). When I try to run ''ktrace -i mutt'' with the ktrace output going to the NFS share, then everything crawls to a halt. While that command is running, even a simple ''ls -la'' of a small directory takes almost 5 seconds. Neither the ktrace nor the mutt command can be killed right away -- they''re blocking on I/O. Alright, so after it finally finished, I try something a bit simpler. ''vi plaintext''. Ok, that''s snappy as should be. Now: ''vim plaintext''. Ugh, that took almost 4 seconds for the editor to come up. There are all kinds of other examples that I tried, but the one standing out the most was trying to create a number of directories: for i in `jot 100`; do mkdir $i for j in `jot 100`; do mkdir $i/$j done done On the IRIX NFS share, this takes about 60 seconds. On the Solaris NFS share, this takes... forever. (I interrupted it after 10 minutes, when it had managed to create 2500 directories.) tcpdump and snoop show me that traffic zips by as it should for the operations described above ((1), (2) and (3)), but become very "bursty" when doing reads and writes simultaneously or when creating the directories. That is, instead of a constant stream of packets zipping by, the tcpdump give me about 15 lines every second, but I can''t find any packet loss. I''ve tried to see if this is a problem with ZFS itself: I ran the same tests on the file server on the ZFS, and everything seems to work just fine there. I''ve tried to mount the filesystem over TCP and with different read/write sizes, with NFSv2 and NFSv3 (the clients don''t support NFSv4). I''ve tried to see if it''s the NIC or the network by testing regular network speeds and connecting the machine to a different switch etc. all to no avail. I''ve played with every setting in /etc/default/nfs to no avail, and I just can''t put my finger on it. Alright, so in my next attempt to see if I''m crazy or not, I installed Solaris 6/06 on another workstation. From there, mounting a ZFS works just dandy, all the above tests are fast. So I reinstall the other machine. After importing the old zpool, nothing has changed. I destroy the zpool and recreate it. Still the same problem. To ensure that it''s not the SAN switch, I connect the Solaris machine directly to the XRaid, and again, no change. I destroy the Raid-5 config on the XRaid and build a Raid-0 across 7 of the disks. Creating a zpool of only this one (striped) disk also does not change performance at all. Creating a regular UFS on this disk, however, immediately fixes the problems! So it''s not the fibre channel switch, it''s not the fibre channel cables, it''s not the fibre channel card, it''s not the gigabit card, it''s not the machine, it''s not the mount options, it simply appears to be ZFS. ZFS on an Apple XRaid, to be precise. (Maybe it''s ZFS on fibre-channel, I don''t know; it''s not ZFS per se, as the other freshly installed machine with ZFS on a SATA local disk worked fine.) Still trying to figure out what exactly is going on here, I then took somebody else''s advise and tried to see if maybe there is a relation between the size of the zpool and the NFS performance. Connecting the machine in question to a different XRaid with a 745 GB Raid-5 disk, I tried to create a single zpool on that disk. Again, the same performance problems as noted earlier. Then I partitioned the disk into a 100 GB partition and tried to create a zpool on that. Again, no luck. Performance still stinks. FWIW, format reports the xraid disks as: 2. c3t0d0 <APPLE-Xserve RAID-1.50-2.73TB> /pci at 0,0/pci1022,7450 at b/pci1000,1010 at 1,1/sd at 0,0 3. c3t1d0 <APPLE-Xserve RAID-1.50-2.73TB> /pci at 0,0/pci1022,7450 at b/pci1000,1010 at 1,1/sd at 1,0 4. c3t2d0 <APPLE-Xserve RAID-1.26-745.21GB> /pci at 0,0/pci1022,7450 at b/pci1000,1010 at 1,1/sd at 2,0 Is there anybody here who''s using ZFS on Apple XRaids and serving them via NFS? Does anybody have any other ideas what I could do to solve this? (I have, in the mean time, converted the XRaid to plain old UFS, and performance is perfectly fine here, but I''d still be interested in what exactly is going on.) -Jan -- http://www.eff.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 186 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060731/97890e9f/attachment.bin>
On Mon, Jul 31, 2006 at 02:17:00PM -0400, Jan Schaumann wrote:> Is there anybody here who''s using ZFS on Apple XRaids and serving them > via NFS? Does anybody have any other ideas what I could do to solve > this? (I have, in the mean time, converted the XRaid to plain old UFS, > and performance is perfectly fine here, but I''d still be interested in > what exactly is going on.)One of the main differences in how UFS and ZFS treats a disk is that ZFS will enable the write cache and send down SYNCHRONIZE_CACHE commands down to the disk (at appropriate points) to ensure that data hits stable storage. Also, NFS tends to force this operation quite often. My guess would be that the Apple XRaid responds "poorly" to these commands, leading to the clumps of hair that are probably on your floor. :) To test this theory, run this command on your NFS server (as root): echo ''::spa -v'' | mdb -k | \ awk ''/dev.dsk/{print $1"::print -a vdev_t vdev_nowritecache"}'' | \ mdb -k | awk ''{print $1"/W1"}'' | mdb -kw This will turn off write cache flushing on all devices in all pools. Note that this is a very dangerous state to run in if your disk/RAID has caching enabled without NVRAM. To put things back to normal, type the above command, replacing the W1 in the last line with W0. If this helps, we can start thinking about options for fixing it. --Bill
Bill Moore <Bill.Moore at sun.com> wrote:> To test this theory, run this command on your NFS server (as root): > > echo ''::spa -v'' | mdb -k | \ > awk ''/dev.dsk/{print $1"::print -a vdev_t vdev_nowritecache"}'' | \ > mdb -k | awk ''{print $1"/W1"}'' | mdb -kwThanks for the suggestion. However, I''m not sure if the above pipeline is correct: 1# echo ''::spa -v'' | mdb -k ADDR STATE NAME ffffffff85dfa000 ACTIVE tank ADDR STATE AUX DESCRIPTION ffffffff857a0ac0 HEALTHY - root ffffffff857a0580 HEALTHY - /dev/dsk/c3t2d0s0 2# !! | awk ''/dev.dsk/{print $1"::print -a vdev_t vdev_nowritecache"}'' ffffffff857a0580::print -a vdev_t vdev_nowritecache 3# !! | mdb -k 0 4# !! | awk ''{print $1"/W1"}'' 0/W1 5# !! | mdb -kw mdb: failed to write 1 at address 0x0: no mapping for address 6# Since I''m not very familiar with Solaris at all, I''m not sure what exactly is ment to come out of "#3", but I suspect it''s not supposed to be just "0". -Jan -- The reader is encouraged to add smileys where necessary to increase positive perception. Right here might be a good place: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 186 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060731/e9732101/attachment.bin>
On Mon, Jul 31, 2006 at 03:59:23PM -0400, Jan Schaumann wrote:> Thanks for the suggestion. However, I''m not sure if the above pipeline > is correct: > > 2# !! | awk ''/dev.dsk/{print $1"::print -a vdev_t vdev_nowritecache"}'' > ffffffff857a0580::print -a vdev_t vdev_nowritecache > 3# !! | mdb -k > 0Hmm. It should have printed something like this: ffffffff857a0a60 vdev_nowritecache = 0 (B_FALSE) I think there might be a problem with the CTF data (debugging info) in U2. First, check /etc/release and make sure it says something like "Solaris 10 6/06 s10x_u2wos_09a X86" in the first line. Then run this command: echo ''::offsetof vdev_t vdev_nowritecache'' | mdb -k And make sure it prints "4e0" as the answer. If that''s the case, then run this: echo ''::spa -v'' | mdb -k | awk ''/dev.dsk/{print $1"+4e0/W1"}'' | mdb -kw And to get back, replace W1 with W0 as before. --Bill
Bill Moore <Bill.Moore at sun.com> wrote:> Hmm. It should have printed something like this: > > ffffffff857a0a60 vdev_nowritecache = 0 (B_FALSE) > > I think there might be a problem with the CTF data (debugging info) > in U2. First, check /etc/release and make sure it says something like > "Solaris 10 6/06 s10x_u2wos_09a X86" in the first line.So far, so good.> Then run this command: > > echo ''::offsetof vdev_t vdev_nowritecache'' | mdb -k > > And make sure it prints "4e0" as the answer.# echo ''::offsetof vdev_t vdev_nowritecache'' | mdb -k offsetof (vdev_t, vdev_nowritecache) = 0x4c0 # Somebody else pointed out patch 122641-06-1, which might be relevant here, but since I don''t have a Sun Service Plan, I don''t have access to it. :-/ -Jan -- chown -R us:enemy your_base -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 186 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060731/ff3d1ae2/attachment.bin>
On Mon, Jul 31, 2006 at 06:08:04PM -0400, Jan Schaumann wrote:> # echo ''::offsetof vdev_t vdev_nowritecache'' | mdb -k > offsetof (vdev_t, vdev_nowritecache) = 0x4c0Ok, then try this: echo ''::spa -v'' | mdb -k | awk ''/dev.dsk/{print $1"+4c0/W1"}'' | mdb -kw --Bill
On Jul 31, 2006, at 2:17 PM, Jan Schaumann wrote:> Hello all, > > After setting up a Solaris 10 machine with ZFS as the new NFS server, > I''m stumped by some serious performance problems. Here are the > (admittedly long) details (also noted at > http://www.netmeister.org/blog/): > > The machine in question is a dual-amd64 box with 2GB RAM and two > broadcom gigabit NICs. The OS is Solaris 10 6/06 and the filesystem > consists of a single zpool stripe across the two halfs of an Apple > XRaid > (each half configured as RAID5), providing a pool of 5.4 TB.Hello Jan, I''m in a very similar situation. I have two Xserve RAIDs, half of the disks in each are a RAID5''d LUN and those are presented to a Sun X4100 and mirrored in ZFS there. This server is a NFS server for research-related storage and has multiple NFS clients running either Linux 2.4.x or IRIX 6.5.x These clients and their NFS server have dedicated gig-e between them and performance is not exactly stellar. For instance, here is an example of the 2nd command you gave when ran on a Linux NFS client: ====================[root at linux]/mnt$ time dd if=/dev/zero of=blah bs=1024k count=128 128+0 records in 128+0 records out real 1m21.343s user 0m0.000s sys 0m2.480s ==================== 1m 21sec to write a 128MB file over a NFSv3 mount over a gig-e network. The mount options for this linux client are: nfsvers=3,rsize=32768,wsize=32768. No matter what rsize and wsize I set, the time trial results are always in the vicinity of 1m20s. Mounting with NFSv2 and running the same test is even worse. No, it''s horrendous and scary: ====================[root at linux]/mnt$ time dd if=/dev/zero of=blah5 bs=1024k count=128128 +0 records in 128+0 records out real 36m5.642s user 0m0.000s sys 0m2.370s ==================== If I try that test on the NFS server itself in the same volume I NFS mounted on the above Linux client, I get decent speed: ====================[root at ds2]/ds2-store/test/smbshare$ time dd if=/dev/zero of=blah2 bs=1024k count=128 128+0 records in 128+0 records out real 0m0.214s user 0m0.001s sys 0m0.212s ==================== So it seems ZFS itself is OK on top of these Apple Xserve RAIDS (which are running the 1.5 firmware). ZFS volume compression is turned on. I replicated your #3 command and the Linux NFS client read the file back in 2.4 seconds (this was after a umount and remount of the NFS share). So while reads from a NFS client seem ok (still, not great though), and writing to the ZFS volume on the NFS server is also OK, writing over NFS over the dedicated gig-e network is painfully slow. I, too, see bursty traffic with the NFS writes. I put a Solaris 10 host on this gig-e NFS-only network, the same one that the Linux client is on and I mounted the same NFS share with the equivalent options and got far, far, better results: ====================[root at solaris]/$ mount -o vers=3 ds2.rs:/ds2-store/test/smbshare /mnt [root at solaris]/$ cd /mnt [root at solaris]/mnt$ time dd if=/dev/zero of=blah bs=1024k count=128 128+0 records in 128+0 records out real 0m13.349s user 0m0.001s sys 0m0.519s ==================== ... And a read of that file after a unmount/remount (to clear any local cache): ====================[root at solaris]/$ umount /mnt [root at solaris]/$ mount -o vers=3 ds2.rs:/ds2-store/test/smbshare /mnt [root at solaris]/$ cd /mnt [root at solaris]/mnt$ time dd if=blah of=/dev/null bs=1024k 128+0 records in 128+0 records out real 0m11.481s user 0m0.001s sys 0m0.295s ==================== Hmm. It took nearly as long to read the file as it did to write it. Without a remount the file reads back in 0.24 seconds (to be expected of course). So what does this exercise leave me thinking? Is Linux 2.4.x really screwed up in NFS-land? This Solaris NFS replaces a Linux-based NFS server that the clients (linux and IRIX) liked just fine. /dale
On Mon, 31 Jul 2006, Dale Ghent wrote:> So what does this exercise leave me thinking? Is Linux 2.4.x really screwed up > in NFS-land? This Solaris NFS replaces a Linux-based NFS server that theLinux has had, uhhmmm (struggling to be nice), iffy NFS for ages. -- Rich Teer, SCNA, SCSA, OpenSolaris CAB member President, Rite Online Inc. Voice: +1 (250) 979-1638 URL: http://www.rite-group.com/rich
On Jul 31, 2006, at 7:30 PM, Rich Teer wrote:> On Mon, 31 Jul 2006, Dale Ghent wrote: > >> So what does this exercise leave me thinking? Is Linux 2.4.x >> really screwed up >> in NFS-land? This Solaris NFS replaces a Linux-based NFS server >> that the > > Linux has had, uhhmmm (struggling to be nice), iffy NFS for ages.Right, but I never had this speed problem when the NFS server was running Linux on hardware that had the quarter of the CPU power and half the disk i/o capacity that the new Solaris-based one has. So either Linux''s NFS client was more compatible with the bugs in Linux''s NFS server and ran peachy that way, or something''s truly messed up with how Solaris''s NFS server handles Linux NFS clients. Mind you, all the tests I did in my previous posts were on shares served out of ZFS. I just lopped a fresh LUN off another Xserve RAID on my SAN, gave it to the NFS server and put UFS on it. Let''s see if there''s a difference when mounting that on the clients: Linux NFS client mounting UFS-backed share: ====================[root at linux]/$ mount -o nfsvers=3,rsize=32768,wsize=32768 ds2- private:/ufsfoo /mnt [root at linux]/$ cd /mnt [root at linux]/mnt$ time dd if=/dev/zero of=blah bs=1024k count=128128 +0 records in 128+0 records out real 0m9.267s user 0m0.000s sys 0m2.480s ==================== Hey! look at that! 9.2 seconds in this test. The same test with the ZFS-backed share (see previous email in this thread) took 1m 21s to complete. Remember this same test that I did but with a NFSv2 mount and took 36 minutes to complete on the ZFS-backed share? Let''s try that here with the UFS-based share: ====================[root at linux]/$mount -o nfsvers=2,rsize=32768,wsize=32768 ds2-private:/ ufsfoo /mnt [root at linux]/$ cd /mnt [root at linux]/mnt$ time dd if=/dev/zero of=blah2 bs=1024k count=128128 +0 records in 128+0 records out real 0m3.103s user 0m0.000s sys 0m2.880s ==================== Three seconds vs. 36 minutes. Me thinks that there''s something fishy here, regardless of Linux''s reputation in the NFS world. Don''t get me wrong. I love Solaris like I love taffy (and BOY do I love taffy) but there seems to be some really wonky Linux<->NFS<- >Solaris<->ZFS interaction going on that''s really killing performance and my finger so far points at Solaris. :/ /dale
Rich Teer wrote:>On Mon, 31 Jul 2006, Dale Ghent wrote: > > > >>So what does this exercise leave me thinking? Is Linux 2.4.x really screwed up >>in NFS-land? This Solaris NFS replaces a Linux-based NFS server that the >> >> > >Linux has had, uhhmmm (struggling to be nice), iffy NFS for ages. > > >The 2.6.x Linux client is much nicer... one thing fixed was the client doing too many commits (which translates to fsyncs on the server). I would still recommend the Solaris client but i''m sure that''s no surprise. But if you''r''e stuck on Linux, upgrade to the latest stable 2.6.x and i''d be curious if it was better. eric
On Jul 31, 2006, at 8:07 PM, eric kustarz wrote:> > The 2.6.x Linux client is much nicer... one thing fixed was the > client doing too many commits (which translates to fsyncs on the > server). I would still recommend the Solaris client but i''m sure > that''s no surprise. But if you''r''e stuck on Linux, upgrade to the > latest stable 2.6.x and i''d be curious if it was better.I''d love to be on kernel 2.6 but due to the philosophical stance towards OpenAFS of some people on the lkml list[1], moving to 2.6 is a tough call for us to do. But that''s another story for another list. The fact is that I''m stuck on 2.4 for the time being and I''m having problems with a Solaris/ZFS NFS server that I''m (and Jan) are not having with Solaris/UFS and (in my case) Linux/XFS NFS server. [1] https://lists.openafs.org/pipermail/openafs-devel/2006-July/ 014041.html /dale
On 7/31/06, Dale Ghent <daleg at elemental.org> wrote:> On Jul 31, 2006, at 8:07 PM, eric kustarz wrote: > > > > > The 2.6.x Linux client is much nicer... one thing fixed was the > > client doing too many commits (which translates to fsyncs on the > > server). I would still recommend the Solaris client but i''m sure > > that''s no surprise. But if you''r''e stuck on Linux, upgrade to the > > latest stable 2.6.x and i''d be curious if it was better. > > I''d love to be on kernel 2.6 but due to the philosophical stance > towards OpenAFS of some people on the lkml list[1], moving to 2.6 is > a tough call for us to do. But that''s another story for another list. > The fact is that I''m stuck on 2.4 for the time being and I''m having > problems with a Solaris/ZFS NFS server that I''m (and Jan) are not > having with Solaris/UFS and (in my case) Linux/XFS NFS server. > > [1] https://lists.openafs.org/pipermail/openafs-devel/2006-July/ > 014041.html > > /dale >First, OpenAFS 1.4 works just fine with 2.6 based kernels. We''ve already standardized on that over 2.4 kernels (deprecated) at Stanford. Second, I had similar fsync fatality when it came to NFS clients (linux or solaris mind you) and non-local backed clients using ZFS on a Solaris 10U2 (or B40+) server. My case was iscsi and it was chalked up to low latency on iSCSI, but I still to this day find NFS write performance on small or multititudes of files at a time with ZFS as a back end to be rather iffy. Its perfectly fast for NFS reads and and its always speedly local to the box, but the NFS/ZFS integration seems problematic. I can always test w/ UFS and get great performance. Its the roundtrips with many fsyncs to the backend storage that ZFS requires for commits that get ya.> _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
>So what does this exercise leave me thinking? Is Linux 2.4.x really >screwed up in NFS-land? This Solaris NFS replaces a Linux-based NFS >server that the clients (linux and IRIX) liked just fine.Yes; the Linux NFS server and client work together just fine but generally only because the Linux NFS server replies that writes are done before they are committed to disk (async operation). The Linux NFS client is not optimized for server which do not do this and it appears to write little before waiting for the commit replies. Casper
>Right, but I never had this speed problem when the NFS server was >running Linux on hardware that had the quarter of the CPU power and >half the disk i/o capacity that the new Solaris-based one has.>So either Linux''s NFS client was more compatible with the bugs in >Linux''s NFS server and ran peachy that way, or something''s truly >messed up with how Solaris''s NFS server handles Linux NFS clients.Yes; in fact, I think it is well known that specifically the 2.4 implementation of the Linux NFS client and server cut corners which made the NFS client perform well witht he Linux NFS server but not others.>Mind you, all the tests I did in my previous posts were on shares >served out of ZFS. I just lopped a fresh LUN off another Xserve RAID >on my SAN, gave it to the NFS server and put UFS on it. Let''s see if >there''s a difference when mounting that on the clients: > >Linux NFS client mounting UFS-backed share: >====================>[root at linux]/$ mount -o nfsvers=3,rsize=32768,wsize=32768 ds2- >private:/ufsfoo /mnt >[root at linux]/$ cd /mnt >[root at linux]/mnt$ time dd if=/dev/zero of=blah bs=1024k count=128128 >+0 records in >128+0 records out > >real 0m9.267s >user 0m0.000s >sys 0m2.480s >====================> >Hey! look at that! 9.2 seconds in this test. The same test with the >ZFS-backed share (see previous email in this thread) took 1m 21s to >complete. Remember this same test that I did but with a NFSv2 mount >and took 36 minutes to complete on the ZFS-backed share? Let''s try >that here with the UFS-based share:Now, this is *very* interesting.>====================>[root at linux]/$mount -o nfsvers=2,rsize=32768,wsize=32768 ds2-private:/ >ufsfoo /mnt >[root at linux]/$ cd /mnt >[root at linux]/mnt$ time dd if=/dev/zero of=blah2 bs=1024k count=128128 >+0 records in >128+0 records out > >real 0m3.103s >user 0m0.000s >sys 0m2.880s >====================> >Three seconds vs. 36 minutes. > >Me thinks that there''s something fishy here, regardless of Linux''s >reputation in the NFS world.Again, this may well point to the time between write and write reply and the fact that the Linux NFS server usually replies before the write is done and the Linux NF client only allows for very few outstanding writes. Casper
Bill Moore <Bill.Moore at sun.com> wrote:> On Mon, Jul 31, 2006 at 06:08:04PM -0400, Jan Schaumann wrote: > > # echo ''::offsetof vdev_t vdev_nowritecache'' | mdb -k > > offsetof (vdev_t, vdev_nowritecache) = 0x4c0 > > Ok, then try this: > > echo ''::spa -v'' | mdb -k | awk ''/dev.dsk/{print $1"+4c0/W1"}'' | mdb -kwAlright, this yields # echo ''::spa -v'' | mdb -k | awk ''/dev.dsk/{print $1"+4c0/W1"}'' | mdb -kw 0xffffffff857a0a40: 0 = 0x1 # However, mounting the disk on the remote host does not show any change in behaviour whatsoever. -Jan -- As we all know, reality is a mess. Larry Wall -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 186 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060801/d51b7929/attachment.bin>
On Aug 1, 2006, at 03:43, Casper.Dik at Sun.COM wrote:> >> So what does this exercise leave me thinking? Is Linux 2.4.x really >> screwed up in NFS-land? This Solaris NFS replaces a Linux-based NFS >> server that the clients (linux and IRIX) liked just fine. > > > Yes; the Linux NFS server and client work together just fine but > generally > only because the Linux NFS server replies that writes are done before > they are committed to disk (async operation). > > The Linux NFS client is not optimized for server which do not do this > and it appears to write little before waiting for the commit replies.Well .. linux clients with linux servers tend to be slightly better behaved since the server essentially fudges on the commit and the async cluster count is generally higher (it won''t switch on every operation like Solaris will by default) Additionally there''s a VM issue in the page-writeback code that seems to affect write performance and RPC socket performance when there''s a high dirty page count. Essentially as pages are flushed there''s a higher number of NFS commit operations which will tend to slow down the Solaris NFS server (and probably the txgs or zil as well with the increase in synchronous behaviour.) On the linux 2.6 VM - the number of commits has been seen to rise dramatically when the dirty page count is between 40-90% of the overall system memory .. by tuning the dirtypage_ratio back down to 10% there''s typically less time spent in page-writeback and the overall async throughput should rise .. this wasn''t really addressed until 2.6.15 or 2.6.16 so you might also get better results on a later kernel. Watching performance between a linux client and a linux server - the linux server seems to buffer the NFS commit operations .. of course the clients will also buffer as much as they can - so you can end up with some unbelievable performance numbers both on the filesystem layers (before you do a sync) and on the NFS client layers as well (until you unmount/remount.) Overall, I find that the Linux VM suffers from many of the same sorts of large memory performance problems that Solaris used to face before priority paging in 2.6 and subsequent page coloring schemes. Based on my unscientific mac powerbook performance observations - i suspect that there could be similar issues with various iterations of the BSD or Darwin kernels - but I haven''t taken the initiative to really study any of this. So to wrap up: When doing linux client / solaris server NFS .. I''ll typically tune the client for 32KB async tcp transfers (you have to dig into the kernel source to increase this and it''s not really worth it) tune the VM to reduce time spent in the kludgy page-writeback (typically a sysctl setting for the dirty page ratio or some such), and then increase the nfs:nfs3_async_clusters and nfs:nfs4_async_clusters to something higher than 1 .. say 32 x 32KB transfers to get you to 1MB .. you can also increase the numbers of threads and the read ahead on the server to eek out some more performance I''d also look at tuning the volblocksize and recordsize as well as the stripe width on your array to 32K or reasonable multiples .. but I''m not sure how much of the issue is in misaligned I/O blocksizes between the various elements vs mandatory pauses or improper behaviour incurred from miscommunication .. --- .je
Joe Little wrote:> On 7/31/06, Dale Ghent <daleg at elemental.org> wrote: > >> On Jul 31, 2006, at 8:07 PM, eric kustarz wrote: >> >> > >> > The 2.6.x Linux client is much nicer... one thing fixed was the >> > client doing too many commits (which translates to fsyncs on the >> > server). I would still recommend the Solaris client but i''m sure >> > that''s no surprise. But if you''r''e stuck on Linux, upgrade to the >> > latest stable 2.6.x and i''d be curious if it was better. >> >> I''d love to be on kernel 2.6 but due to the philosophical stance >> towards OpenAFS of some people on the lkml list[1], moving to 2.6 is >> a tough call for us to do. But that''s another story for another list. >> The fact is that I''m stuck on 2.4 for the time being and I''m having >> problems with a Solaris/ZFS NFS server that I''m (and Jan) are not >> having with Solaris/UFS and (in my case) Linux/XFS NFS server. >> >> [1] https://lists.openafs.org/pipermail/openafs-devel/2006-July/ >> 014041.html >> >> /dale >> > > First, OpenAFS 1.4 works just fine with 2.6 based kernels. We''ve > already standardized on that over 2.4 kernels (deprecated) at > Stanford. Second, I had similar fsync fatality when it came to NFS > clients (linux or solaris mind you) and non-local backed clients using > ZFS on a Solaris 10U2 (or B40+) server. My case was iscsi and it was > chalked up to low latency on iSCSI, but I still to this day find NFS > write performance on small or multititudes of files at a time with ZFS > as a back end to be rather iffy. Its perfectly fast for NFS reads and > and its always speedly local to the box, but the NFS/ZFS integration > seems problematic. I can always test w/ UFS and get great performance. > Its the roundtrips with many fsyncs to the backend storage that ZFS > requires for commits that get ya.Do you have a reproducable test case for this? If so, i would be interested... I wonder if you''re hitting: 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files in the same FS http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6413510 which Neil is finishing up as we type. The problem basically is that fsyncs can get slowed down by non-related I/O, so if you had a process/NFS client that was doing lots of I/O and another doing fsyncs, the fsyncs would get slowed down by the other process/client. eric> > > >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>
I''ve submitted these to Roch and co before on the NFS list and off list. My favorite case was writing 6250 8k files (randomly generated) over NFS from a solaris or linux client. We originally were getting 20K/sec when I was using RAIDZ, but between switching to RAID-5 backed iscsi luns in a zpool stripe and B40/41, we saw our performance approach a more reasonable 300-400K/sec average. I get closer to 1-3MB/sec with UFS as the backend vs ZFS. Of course, if its locally attached storage (not iSCSI) performance starts to be parallel to that of UFS or better. There is some built in latency and some major penalties for streaming writes of various sizes with the NFS implementation and its fsync happiness (3 fsyncs per write from an NFS client). Its all very true that its stable/safe, but its also very slow in various use cases! On 8/1/06, eric kustarz <eric.kustarz at sun.com> wrote:> Joe Little wrote: > > > On 7/31/06, Dale Ghent <daleg at elemental.org> wrote: > > > >> On Jul 31, 2006, at 8:07 PM, eric kustarz wrote: > >> > >> > > >> > The 2.6.x Linux client is much nicer... one thing fixed was the > >> > client doing too many commits (which translates to fsyncs on the > >> > server). I would still recommend the Solaris client but i''m sure > >> > that''s no surprise. But if you''r''e stuck on Linux, upgrade to the > >> > latest stable 2.6.x and i''d be curious if it was better. > >> > >> I''d love to be on kernel 2.6 but due to the philosophical stance > >> towards OpenAFS of some people on the lkml list[1], moving to 2.6 is > >> a tough call for us to do. But that''s another story for another list. > >> The fact is that I''m stuck on 2.4 for the time being and I''m having > >> problems with a Solaris/ZFS NFS server that I''m (and Jan) are not > >> having with Solaris/UFS and (in my case) Linux/XFS NFS server. > >> > >> [1] https://lists.openafs.org/pipermail/openafs-devel/2006-July/ > >> 014041.html > >> > >> /dale > >> > > > > First, OpenAFS 1.4 works just fine with 2.6 based kernels. We''ve > > already standardized on that over 2.4 kernels (deprecated) at > > Stanford. Second, I had similar fsync fatality when it came to NFS > > clients (linux or solaris mind you) and non-local backed clients using > > ZFS on a Solaris 10U2 (or B40+) server. My case was iscsi and it was > > chalked up to low latency on iSCSI, but I still to this day find NFS > > write performance on small or multititudes of files at a time with ZFS > > as a back end to be rather iffy. Its perfectly fast for NFS reads and > > and its always speedly local to the box, but the NFS/ZFS integration > > seems problematic. I can always test w/ UFS and get great performance. > > Its the roundtrips with many fsyncs to the backend storage that ZFS > > requires for commits that get ya. > > > Do you have a reproducable test case for this? If so, i would be > interested... > > I wonder if you''re hitting: > 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files > in the same FS > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6413510 > > which Neil is finishing up as we type. > > The problem basically is that fsyncs can get slowed down by non-related > I/O, so if you had a process/NFS client that was doing lots of I/O and > another doing fsyncs, the fsyncs would get slowed down by the other > process/client. > > eric > > > > > > > > >> _______________________________________________ > >> zfs-discuss mailing list > >> zfs-discuss at opensolaris.org > >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >> > >
I had the same problem. Read the following article - http://docs.info.apple.com/article.html?artnum=302780 Most likely you have "Allow host cache Flushing" checked. Uncheck it and try again. This message posted from opensolaris.org
Please read also http://docs.info.apple.com/article.html?artnum=303503. This message posted from opensolaris.org