Hello All, I have been running FBSD a long while, and actually running since the 5.x releases on the server I am having troubles with. I basically have a small network and just use NIS/NFS to link my various FBSD and Solaris machines together. This has all been running fine up till a few days ago, when all of a sudden NFS came to a crawl, and CPU usage so high the box appears to freeze almost. When I had 6.1-RC running all seemed well, then came the announcement for the official 6.1 release, so I did the cvs updates, made world, kernel, and ran mergemaster to get everything up to the 6.1 stable version. Now after doing this, something is wrong with NFS. It works, it will return information and open files, just it's very very slow, and while performing a request the CPU spike is astounding. A simple du of my home directory can take minutes, and machine all but locks up if the request is done over NFS. Here is top snip: PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 497 root 1 4 0 1252K 780K - 2 50:42 188.48% nfsd This is a nice IBM eServer with dual P4-XEON's and a couple GB or RAM on a disk array, and locally is screams, heck NFS used to scream till I updated. I am not really sure what info would be useful in debugging, so won't post tons of misc junk in this eMail, but if anyone has any ideas as to how best to figure out and resolve this issue it would sure be appreicated... --- Howard Leadmon http://www.leadmon.net
Howard Leadmon wrote:> Hello All, > > I have been running FBSD a long while, and actually running since the 5.x > releases on the server I am having troubles with. I basically have a small > network and just use NIS/NFS to link my various FBSD and Solaris machines > together. > > This has all been running fine up till a few days ago, when all of a sudden > NFS came to a crawl, and CPU usage so high the box appears to freeze almost. > When I had 6.1-RC running all seemed well, then came the announcement for the > official 6.1 release, so I did the cvs updates, made world, kernel, and ran > mergemaster to get everything up to the 6.1 stable version. > > Now after doing this, something is wrong with NFS. It works, it will return > information and open files, just it's very very slow, and while performing a > request the CPU spike is astounding. A simple du of my home directory can > take minutes, and machine all but locks up if the request is done over NFS. > Here is top snip: > > PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND > 497 root 1 4 0 1252K 780K - 2 50:42 188.48% nfsd > > > This is a nice IBM eServer with dual P4-XEON's and a couple GB or RAM on a > disk array, and locally is screams, heck NFS used to scream till I updated. I > am not really sure what info would be useful in debugging, so won't post tons > of misc junk in this eMail, but if anyone has any ideas as to how best to > figure out and resolve this issue it would sure be appreicated... >Are you running rpc.lockd? I've had very bad luck with it since sometime in the 5.x series... especially with it interoperating with Solaris. I submitted a PR on it, but it's apparently broken in about X ways. If possible, I would suggest living without rpc.lockd for now (if you're currently living with it that is) Other than that issue, NFS itself has been working nicely for me.
> Are you running rpc.lockd? I've had very bad luck with it since > sometime in the 5.x series... especially with it interoperating with > Solaris. I submitted a PR on it, but it's apparently broken in about X > ways. If possible, I would suggest living without rpc.lockd for now (if > you're currently living with it that is)On the contrary NFS problems interoperating with Linux have been cleared for me since upgrading Linux to Fedora Core 5 and FreeBSD to 6.1. In particular rpc.lockd works, everything is OK, performance is fine. I had very bad problems in the past, when we were running Fedora Core 3. -- Michel TALON
On Sun, May 14, 2006 at 02:28:55PM -0400, Howard Leadmon wrote:> > Hello All, > > I have been running FBSD a long while, and actually running since the 5.x > releases on the server I am having troubles with. I basically have a small > network and just use NIS/NFS to link my various FBSD and Solaris machines > together. > > This has all been running fine up till a few days ago, when all of a sudden > NFS came to a crawl, and CPU usage so high the box appears to freeze almost. > When I had 6.1-RC running all seemed well, then came the announcement for the > official 6.1 release, so I did the cvs updates, made world, kernel, and ran > mergemaster to get everything up to the 6.1 stable version. > > Now after doing this, something is wrong with NFS. It works, it will return > information and open files, just it's very very slow, and while performing a > request the CPU spike is astounding. A simple du of my home directory can > take minutes, and machine all but locks up if the request is done over NFS. > Here is top snip: > > PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND > 497 root 1 4 0 1252K 780K - 2 50:42 188.48% nfsd > > > This is a nice IBM eServer with dual P4-XEON's and a couple GB or RAM on a > disk array, and locally is screams, heck NFS used to scream till I updated. I > am not really sure what info would be useful in debugging, so won't post tons > of misc junk in this eMail, but if anyone has any ideas as to how best to > figure out and resolve this issue it would sure be appreicated...Use tcpdump and related tools to find out what traffic is being sent. Also verify that you did not change your system configuration in any way: there have been no changes to NFS since the release, so it is unclear why an update would cause the problem to suddenly occur. Kris -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20060515/06c1a1ba/attachment.pgp
"Rong-en Fan" <grafan@gmail.com> wrote:> On 5/14/06, Kris Kennaway <kris@obsecurity.org> wrote: >> On Sun, May 14, 2006 at 02:28:55PM -0400, Howard Leadmon wrote: >>>[...]>> Use tcpdump and related tools to find out what traffic is being sent. >> >> Also verify that you did not change your system configuration in any >> way: there have been no changes to NFS since the release, so it is >> unclear why an update would cause the problem to suddenly occur. >> >> Kris > > Hi Kris and Howard, > > As I posted few days ago, I have similar problems like Howard's > (some details in the thread "6.1-RELEASE, em0 high interrupt rate > and nfsd eats lots of cpu" on stable@). After binary searching > the source tree, I found that > > RELENG_6_1, 2006.04.30.03.57 ok > RELENG_6_1, 2006.04.30.04.00 bad > > The only commit is kern/vfs_lookup.c, an MFC of rev 1.90 and 1.91. > With 04.30 03.57's source + manaully patched vfs_lookup.c rev 1.90, > the same problem occurs.[...] Confirmed! I can create the problem here at will. Setup 1: NFS server 'testido' FreeBSD 6.1-STABLE as of 15. May 2006 with sys/kern/vfs_lookup.c 1.80.2.7, NFS schurks FreeBSD 6.1-STABLE as of 15. May 2006. /usr/src from testido mounted on /mnt on schurks. running 'cd /mnt ; du >/dev/null' two times (first after fresh boot of testido second when all served data is in memory of testido): joerg @ schurks> cd /mnt joerg @ schurks> time du >/dev/null 86.09s real 0.14s user 1.91s system joerg @ schurks> time du >/dev/null 205.10s real 0.20s user 1.92s system joerg @ schurks> Screenfull output of top on testido AFTER both tests (testido stopped responding to screen output sometimes, especially during the second test): last pid: 329; load averages: 4.14, 2.77, 1.25 up 0+00:07:30 18:44:47 29 processes: 1 running, 28 sleeping CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle Mem: 8420K Active, 28M Inact, 72M Wired, 110M Buf, 880M Free Swap: 4000M Total, 4000M Free PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND 201 root 1 4 0 1232K 792K - 4:42 116.31% nfsd 329 joerg 1 96 0 2404K 1676K RUN 0:00 0.00% top 168 root 1 115 0 2456K 1760K select 0:00 0.00% sshd 313 root 1 96 0 1428K 1168K select 0:00 0.00% rlogind 194 root 1 115 0 1556K 1256K select 0:00 0.00% mountd 299 root 1 8 0 1720K 1436K wait 0:00 0.00% login 314 root 1 8 0 1748K 1460K wait 0:00 0.00% login 298 root 1 96 0 1304K 1048K select 0:00 0.00% rlogind 199 root 1 4 0 1356K 1040K accept 0:00 0.00% nfsd 256 root 1 96 0 2892K 1760K select 0:00 0.00% ntpd 315 joerg 1 20 0 1448K 1020K pause 0:00 0.00% ksh 300 root 1 5 0 1448K 996K ttyin 0:00 0.00% ksh 158 root 1 96 0 1332K 940K select 0:00 0.00% syslogd 163 root 1 96 0 1448K 1128K select 0:00 0.00% inetd 176 root 1 96 0 1408K 1044K select 0:00 0.00% rpcbind 185 root 1 96 0 1476K 1148K select 0:00 0.00% ypbind 261 root 1 115 0 1304K 952K select 0:00 0.00% lpd Setup 2: NFS server 'testido' FreeBSD 6.1-STABLE as of 15. May 2006 with sys/kern/vfs_lookup.c 1.80.2.6, NFS schurks FreeBSD 6.1-STABLE as of 15. May 2006. Same tests as before: joerg @ schurks> time du >/dev/null 22.63s real 0.15s user 1.82s system joerg @ schurks> time du >/dev/null 16.52s real 0.17s user 1.68s system joerg @ schurks> Screenfull output of top on testido AFTER both tests (testido responded fine during both tests): last pid: 329; load averages: 0.49, 0.26, 0.10 up 0+00:01:50 18:35:30 29 processes: 1 running, 28 sleeping CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle Mem: 8424K Active, 28M Inact, 72M Wired, 110M Buf, 880M Free Swap: 4000M Total, 4000M Free PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND 201 root 1 4 0 1232K 792K - 0:03 3.76% nfsd 168 root 1 115 0 2456K 1760K select 0:00 0.00% sshd 329 joerg 1 96 0 2404K 1676K RUN 0:00 0.00% top 313 root 1 96 0 1428K 1168K select 0:00 0.00% rlogind 194 root 1 115 0 1556K 1256K select 0:00 0.00% mountd 299 root 1 8 0 1720K 1440K wait 0:00 0.00% login 314 root 1 8 0 1748K 1464K wait 0:00 0.00% login 298 root 1 96 0 1304K 1048K select 0:00 0.00% rlogind 199 root 1 4 0 1356K 1040K accept 0:00 0.00% nfsd 315 joerg 1 20 0 1448K 1020K pause 0:00 0.00% ksh 256 root 1 96 0 2892K 1760K select 0:00 0.00% ntpd 300 root 1 5 0 1448K 996K ttyin 0:00 0.00% ksh 158 root 1 96 0 1332K 940K select 0:00 0.00% syslogd 163 root 1 96 0 1448K 1128K select 0:00 0.00% inetd 261 root 1 109 0 1304K 952K select 0:00 0.00% lpd 176 root 1 96 0 1408K 1044K select 0:00 0.00% rpcbind 185 root 1 96 0 1476K 1148K select 0:00 0.00% ypbind See the HUGE difference in consumed TIME. The only difference was sys/kern/vfs_lookup.c version 1.80.2.6 vs. 1.80.2.7. Joerg -- Mail: Joerg.Lehners@Informatik.Uni-Oldenburg.DE Tel: 2198 Real: Joerg Lehners, Informatik ARBI, Uni Oldenburg, D-26111 Oldenburg Unwoerter: Kostensenkung - Gewinnmaximierung - billig, billig, billig