Hi, I am setting up an environment with FreeBSD 11.1 sharing a ZFS datastore to vmware ESXI 6.7. There were a number of errors with NFS 4.1 sharing that I didn't understand until I found the following thread. <https://lists.freebsd.org/pipermail/freebsd-stable/2018-March/088486.html> I traced the commits that Rick has made since that thread and merged them 'head' into 'stable': 'svnlite checkout http://svn.freebsd.org/base/release/11.1.0/' 'svnlite merge -c 332790 http://svn.freebsd.org/base/head' 'svnlite merge -c 333508 http://svn.freebsd.org/base/head' 'svnlite merge -c 333579 http://svn.freebsd.org/base/head' 'svnlite merge -c 333580 http://svn.freebsd.org/base/head' 'svnlite merge -c 333592 http://svn.freebsd.org/base/head' 'svnlite merge -c 333645 http://svn.freebsd.org/base/head' 'svnlite merge -c 333766 http://svn.freebsd.org/base/head' 'svnlite merge -c 334396 http://svn.freebsd.org/base/head' 'svnlite merge -c 334492 http://svn.freebsd.org/base/head' 'svnlite merge -c 327674 http://svn.freebsd.org/base/head' That completely fixed the connection instability, but the NFS share was still mounting read-only with a RECLAIM_COMPLETE error. So, I manually applied the first patch from the previous thread and everything started working: --- fs/nfsserver/nfs_nfsdserv.c.savrecl 2018-02-10 20:34:31.166445000 -0500 +++ fs/nfsserver/nfs_nfsdserv.c 2018-02-10 20:36:07.947490000 -0500 @@ -4226,10 +4226,9 @@ nfsrvd_reclaimcomplete(struct nfsrv_desc goto nfsmout; } NFSM_DISSECT(tl, uint32_t *, NFSX_UNSIGNED); + nd->nd_repstat = nfsrv_checkreclaimcomplete(nd); if (*tl == newnfs_true) - nd->nd_repstat = NFSERR_NOTSUPP; - else - nd->nd_repstat = nfsrv_checkreclaimcomplete(nd); + nd->nd_repstat = 0; The question is: Did I miss something? Is there an alternate change already in SVN that does the same thing better, or is there some corner case preventing this patch from being finalized that I just haven't run into yet? Thanks, Daniel Engel
Daniel Engel wrote:>I am setting up an environment with FreeBSD 11.1 sharing a ZFS datastore to vmware >ESXI 6.7. There were a number of errors with NFS 4.1 sharing that I didn't >understand until I found the following thread. > > <https://lists.freebsd.org/pipermail/freebsd-stable/2018-March/088486.html> > >I traced the commits that Rick has made since that thread and merged them 'head' >into 'stable': > > 'svnlite checkout http://svn.freebsd.org/base/release/11.1.0/' > 'svnlite merge -c 332790 http://svn.freebsd.org/base/head' > 'svnlite merge -c 333508 http://svn.freebsd.org/base/head' > 'svnlite merge -c 333579 http://svn.freebsd.org/base/head' > 'svnlite merge -c 333580 http://svn.freebsd.org/base/head' > 'svnlite merge -c 333592 http://svn.freebsd.org/base/head' > 'svnlite merge -c 333645 http://svn.freebsd.org/base/head' > 'svnlite merge -c 333766 http://svn.freebsd.org/base/head' > 'svnlite merge -c 334396 http://svn.freebsd.org/base/head' > 'svnlite merge -c 334492 http://svn.freebsd.org/base/head' > 'svnlite merge -c 327674 http://svn.freebsd.org/base/head' > >That completely fixed the connection instability, but the NFS share was still mounting >read-only with a RECLAIM_COMPLETE error. So, I manually applied the first patch >from the previous thread and everything started working: > > --- fs/nfsserver/nfs_nfsdserv.c.savrecl 2018-02-10 20:34:31.166445000 -0500 > +++ fs/nfsserver/nfs_nfsdserv.c 2018-02-10 20:36:07.947490000 -0500 > @@ -4226,10 +4226,9 @@ nfsrvd_reclaimcomplete(struct nfsrv_desc > goto nfsmout; > } > NFSM_DISSECT(tl, uint32_t *, NFSX_UNSIGNED); > + nd->nd_repstat = nfsrv_checkreclaimcomplete(nd); > if (*tl == newnfs_true) > - nd->nd_repstat = NFSERR_NOTSUPP; > - else > - nd->nd_repstat = nfsrv_checkreclaimcomplete(nd); > + nd->nd_repstat = 0; > >The question is: Did I miss something? Is there an alternate change already in SVN >that does the same thing better, or is there some corner case preventing this patch >from being finalized that I just haven't run into yet?Andreas Nagy has been doing quite a bit of testing for me w.r.t the ESXi 6.5 client, but several serious issues (which appear to be violations of the RFC to me) have not yet been resolved. This email summarizes then: http://docs.FreeBSD.org/cgi/mid.cgi?YTOPR0101MB0953E687D013E2E97873061ADD720 He recently reported that 6.7 worked better, but he has not yet sent me any packet traces, so I don't know which issues still exist for 6.7. I have committed a few things that didn't break the RFC, such as adding BindConnectiontoSession, but I haven't committed anything else yet, due to concerns w.r.t. violating the RFC. (The above email thread discusses that.) I do plan on doing something once I get packet traces from Andreas, but be forewarned that VMware states "FreeBSD is not a supported server" and that is certainly true. Andreas uses connection trunking. You might be ok with a single TCP connection unless the server reboots. (He runs a bunch of patches I gave him, some of which definitely violate the RFC.) All I can suggest is that you keep an eye on freebsd-current@ for any email about commits to handle the ESXi client better. So, this is very much a work in progress, rick
Daniel Engel wrote: [stuff snipped]>I traced the commits that Rick has made since that thread and merged them 'head' >into 'stable': > > 'svnlite checkout http://svn.freebsd.org/base/release/11.1.0/' > 'svnlite merge -c 332790 http://svn.freebsd.org/base/head' > 'svnlite merge -c 333508 http://svn.freebsd.org/base/head' > 'svnlite merge -c 333579 http://svn.freebsd.org/base/head' > 'svnlite merge -c 333580 http://svn.freebsd.org/base/head' > 'svnlite merge -c 333592 http://svn.freebsd.org/base/head' > 'svnlite merge -c 333645 http://svn.freebsd.org/base/head' > 'svnlite merge -c 333766 http://svn.freebsd.org/base/head' > 'svnlite merge -c 334396 http://svn.freebsd.org/base/head' > 'svnlite merge -c 334492 http://svn.freebsd.org/base/head' > 'svnlite merge -c 327674 http://svn.freebsd.org/base/head'Yes, you have all the commits to head related to the 4.1 server that might affect the ESXi client, plus a bunch that should be harmless, but I don't think affect the ESXi client mounts. (Most of these will get MFC'd to stable/11, but I haven't gotten around to it yet.) The ones that might be in 6.7 (they were in 6.5) that may bite you are: - The client does an OpenDownGrade with all OPEN_SHARE_ACCESS and OPEN_SHARE_DENY bits set for something it calls a "drive lock". (Adding bits is supposed to be done via an Open/ClaimNull and not OpenDowngrade.) I'd really like to know if this still happens for 6.7? - Something about "directory modified too often" when doing deletion of a bunch of files. (I have no idea what this one means, but apparently it was seen for other NFSv4.1 servers.) - Some warnings about "wrong reason for not issuing a delegation". I have a fix for this one in PR#226650, but they are just warnings and don't seem to matter much. The rest of the really nasty stuff happens after a server reboot. The recovery code seemed to be badly broken in the 6.5 client. (All sorts of fun stuff like the client looping doiing ExchangeID operations forever. VM crashes...)>That completely fixed the connection instability, but the NFS share was still mounting >read-only with a RECLAIM_COMPLETE error. So, I manually applied the first patch >from the previous thread and everything started working: > > --- fs/nfsserver/nfs_nfsdserv.c.savrecl 2018-02-10 20:34:31.166445000 -0500 > +++ fs/nfsserver/nfs_nfsdserv.c 2018-02-10 20:36:07.947490000 -0500 > @@ -4226,10 +4226,9 @@ nfsrvd_reclaimcomplete(struct nfsrv_desc > goto nfsmout; > } > NFSM_DISSECT(tl, uint32_t *, NFSX_UNSIGNED); > + nd->nd_repstat = nfsrv_checkreclaimcomplete(nd); > if (*tl == newnfs_true) > - nd->nd_repstat = NFSERR_NOTSUPP; > - else > - nd->nd_repstat = nfsrv_checkreclaimcomplete(nd); > + nd->nd_repstat = 0;I think this patch is ok to use, since no other extant client does a ReclaimComplete with "one_fs == true". It does kinda violate the RFC. The problem is that FreeBSD exports a hierarchy of file systems and telling the server that one of them has been reclaimed is useless. (This hack just assumes the client meant to say "one_fs == false".) There was also a case (I think it was after a server reboot) where the client would do one of these after doing a ReclaimComplete with "one_fs == false" and that is definitely bogus (the server would reply NFS4ERR_ALREADY_COMPLETE without the above hack) since the "one_fs == false" operation means all file systems have been reclaimed. Anyhow, once I get some packet traces from Andreas for 6.7, I'll try and figure out how to handle at least some of the outstanding issues. Good luck with it, rick