thr3ads.net - freebsd stable - NFS 4.1 RECLAIM_COMPLETE FS failed error [Jul 2018]

If this information is useful, please help other people find it:
Share via:

Daniel Engel

2018-Jul-08 22:03 UTC

NFS 4.1 RECLAIM_COMPLETE FS failed error

Hi, 

I am setting up an environment with FreeBSD 11.1 sharing a ZFS datastore to
vmware ESXI 6.7.  There were a number of errors with NFS 4.1 sharing that I
didn't understand until I found the following thread.

   
<https://lists.freebsd.org/pipermail/freebsd-stable/2018-March/088486.html>

I traced the commits that Rick has made since that thread and merged them
'head' into 'stable':

    'svnlite checkout http://svn.freebsd.org/base/release/11.1.0/'
    'svnlite merge -c 332790 http://svn.freebsd.org/base/head'
    'svnlite merge -c 333508 http://svn.freebsd.org/base/head'
    'svnlite merge -c 333579 http://svn.freebsd.org/base/head'
    'svnlite merge -c 333580 http://svn.freebsd.org/base/head'
    'svnlite merge -c 333592 http://svn.freebsd.org/base/head'
    'svnlite merge -c 333645 http://svn.freebsd.org/base/head'
    'svnlite merge -c 333766 http://svn.freebsd.org/base/head'
    'svnlite merge -c 334396 http://svn.freebsd.org/base/head'
    'svnlite merge -c 334492 http://svn.freebsd.org/base/head'
    'svnlite merge -c 327674 http://svn.freebsd.org/base/head'

That completely fixed the connection instability, but the NFS share was still
mounting read-only with a RECLAIM_COMPLETE error.  So, I manually applied the
first patch from the previous thread and everything started working:

    --- fs/nfsserver/nfs_nfsdserv.c.savrecl	2018-02-10 20:34:31.166445000 -0500
    +++ fs/nfsserver/nfs_nfsdserv.c	2018-02-10 20:36:07.947490000 -0500
    @@ -4226,10 +4226,9 @@ nfsrvd_reclaimcomplete(struct nfsrv_desc
            goto nfsmout;
        }
        NFSM_DISSECT(tl, uint32_t *, NFSX_UNSIGNED);
    +	nd->nd_repstat = nfsrv_checkreclaimcomplete(nd);
        if (*tl == newnfs_true)
    -		nd->nd_repstat = NFSERR_NOTSUPP;
    -	else
    -		nd->nd_repstat = nfsrv_checkreclaimcomplete(nd);
    +		nd->nd_repstat = 0;

The question is: Did I miss something?  Is there an alternate change already in
SVN that does the same thing better, or is there some corner case preventing
this patch from being finalized that I just haven't run into yet?

Thanks,
Daniel Engel

Rick Macklem

2018-Jul-09 01:15 UTC

head link

NFS 4.1 RECLAIM_COMPLETE FS failed error

Daniel Engel wrote:>I am setting up an environment with FreeBSD 11.1 sharing a ZFS datastore to
vmware >ESXI 6.7.  There were a number of errors with NFS 4.1 sharing that I
didn't >understand until I found the following thread.
>
>   
<https://lists.freebsd.org/pipermail/freebsd-stable/2018-March/088486.html>
>
>I traced the commits that Rick has made since that thread and merged them
'head' >into 'stable':
>
>    'svnlite checkout http://svn.freebsd.org/base/release/11.1.0/'
>    'svnlite merge -c 332790 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 333508 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 333579 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 333580 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 333592 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 333645 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 333766 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 334396 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 334492 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 327674 http://svn.freebsd.org/base/head'
>
>That completely fixed the connection instability, but the NFS share was
still mounting >read-only with a RECLAIM_COMPLETE error.  So, I manually
applied the first patch >from the previous thread and everything started
working:
>
>    --- fs/nfsserver/nfs_nfsdserv.c.savrecl     2018-02-10
20:34:31.166445000 -0500
>    +++ fs/nfsserver/nfs_nfsdserv.c     2018-02-10 20:36:07.947490000 -0500
>    @@ -4226,10 +4226,9 @@ nfsrvd_reclaimcomplete(struct nfsrv_desc
>           goto nfsmout;
>        }
>        NFSM_DISSECT(tl, uint32_t *, NFSX_UNSIGNED);
>    +   nd->nd_repstat = nfsrv_checkreclaimcomplete(nd);
>        if (*tl == newnfs_true)
>    -           nd->nd_repstat = NFSERR_NOTSUPP;
>    -   else
>    -           nd->nd_repstat = nfsrv_checkreclaimcomplete(nd);
>    +           nd->nd_repstat = 0;
>
>The question is: Did I miss something?  Is there an alternate change already
in SVN >that does the same thing better, or is there some corner case
preventing this patch >from being finalized that I just haven't run into
yet?Andreas Nagy has been doing quite a bit of testing for me w.r.t the ESXi 6.5
client, but several serious issues (which appear to be violations of the RFC to
me)
have not yet been resolved.

This email summarizes then:
http://docs.FreeBSD.org/cgi/mid.cgi?YTOPR0101MB0953E687D013E2E97873061ADD720

He recently reported that 6.7 worked better, but he has not yet sent me any
packet traces, so I don't know which issues still exist for 6.7.
I have committed a few things that didn't break the RFC, such as adding
BindConnectiontoSession, but I haven't committed anything else yet,
due to concerns w.r.t. violating the RFC. (The above email thread discusses
that.)

I do plan on doing something once I get packet traces from Andreas, but be
forewarned that VMware states "FreeBSD is not a supported server" and
that
is certainly true. Andreas uses connection trunking. You might be ok with a
single TCP connection unless the server reboots.
(He runs a bunch of patches I gave him, some of which definitely violate
 the RFC.)

All I can suggest is that you keep an eye on freebsd-current@ for any email
about commits to handle the ESXi client better.

So, this is very much a work in progress, rick

Rick Macklem

2018-Jul-09 02:10 UTC

head link

NFS 4.1 RECLAIM_COMPLETE FS failed error

Daniel Engel wrote:
[stuff snipped]>I traced the commits that Rick has made since that thread and merged them
'head' >into 'stable':
>
>    'svnlite checkout http://svn.freebsd.org/base/release/11.1.0/'
>    'svnlite merge -c 332790 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 333508 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 333579 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 333580 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 333592 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 333645 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 333766 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 334396 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 334492 http://svn.freebsd.org/base/head'
>    'svnlite merge -c 327674 http://svn.freebsd.org/base/head'Yes, you have all the commits to head related to the 4.1 server that might
affect
the ESXi client, plus a bunch that should be harmless, but I don't think
affect
the ESXi client mounts. (Most of these will get MFC'd to stable/11, but I
haven't
gotten around to it yet.)

The ones that might be in 6.7 (they were in 6.5) that may bite you are:
- The client does an OpenDownGrade with all OPEN_SHARE_ACCESS and
   OPEN_SHARE_DENY bits set for something it calls a "drive lock".
  (Adding bits is supposed to be done via an Open/ClaimNull and not
   OpenDowngrade.) I'd really like to know if this still happens for 6.7?
- Something about "directory modified too often" when doing deletion
of a bunch
  of files. (I have no idea what this one means, but apparently it was seen for
  other NFSv4.1 servers.)
- Some warnings about "wrong reason for not issuing a delegation". I
have a fix
  for this one in PR#226650, but they are just warnings and don't seem to
  matter much.

The rest of the really nasty stuff happens after a server reboot. The recovery
code
seemed to be badly broken in the 6.5 client. (All sorts of fun stuff like the
client
looping doiing ExchangeID operations forever. VM crashes...)
>That completely fixed the connection instability, but the NFS share was
still mounting >read-only with a RECLAIM_COMPLETE error.  So, I manually
applied the first patch >from the previous thread and everything started
working:
>
>    --- fs/nfsserver/nfs_nfsdserv.c.savrecl     2018-02-10
20:34:31.166445000 -0500
>    +++ fs/nfsserver/nfs_nfsdserv.c     2018-02-10 20:36:07.947490000 -0500
>    @@ -4226,10 +4226,9 @@ nfsrvd_reclaimcomplete(struct nfsrv_desc
>            goto nfsmout;
>        }
>        NFSM_DISSECT(tl, uint32_t *, NFSX_UNSIGNED);
>    +   nd->nd_repstat = nfsrv_checkreclaimcomplete(nd);
>        if (*tl == newnfs_true)
>    -           nd->nd_repstat = NFSERR_NOTSUPP;
>    -   else
>    -           nd->nd_repstat = nfsrv_checkreclaimcomplete(nd);
>    +           nd->nd_repstat = 0;I think this patch is ok to use, since no other extant client does a
ReclaimComplete
with "one_fs == true". It does kinda violate the RFC.
The problem is that FreeBSD exports a hierarchy of file systems and telling the
server that one of them has been reclaimed is useless. (This hack just assumes
the client meant to say "one_fs == false".)
There was also a case (I think it was after a server reboot) where the client
would
do one of these after doing a ReclaimComplete with "one_fs == false"
and that is
definitely bogus (the server would reply NFS4ERR_ALREADY_COMPLETE without
the above hack) since the "one_fs == false" operation means all file
systems have
been reclaimed.

Anyhow, once I get some packet traces from Andreas for 6.7, I'll try and
figure
out how to handle at least some of the outstanding issues.

Good luck with it, rick

freebsd stable - Jul 2018 - NFS 4.1 RECLAIM_COMPLETE FS failed error

NFS 4.1 RECLAIM_COMPLETE FS failed error

NFS 4.1 RECLAIM_COMPLETE FS failed error

NFS 4.1 RECLAIM_COMPLETE FS failed error