Garrett Wollman wrote:> Like many other users, I upgrade my FreeBSD servers by NFS-mounting
> /usr/src and /usr/obj from a shared build server.[1] Since I
> upgraded
> the build server to 9.3, clients running 9.3 kernels have been
> randomly erroring out during installkernel and installworld. Today I
> had some time to look more closely into this and found that the error
> is definitely coming from the server: at some point, it just randomly
> starts returning errors to client ACCESS and GETATTR operations. The
> errors are a mix of NFS3ERR_IO and NFS3ERR_ACCES, but there is
> nothing
> on the server to indicate any kind of error, and restarting the
> operation on the client causes it to fail in a different place. With
> enough patients and restarts, it's possible to complete the
> installation in just four or five passes.
>
> Needless to say this is a bit worrying. Strangely, 9.1 and 9.2
> clients don't see this issue at all; it's only 9.3 clients that
> break.
>
> It's easy to reproduce, just 'cd /usr/sc && find . -type f
> >/dev/null'.
> It does not seem to depend on the client NFS version (3 or 4) or
> implementation ("old" or "new"). I haven't tried
the "old" server
> yet
> -- I'll need to figure out how to do that first.
>
Well, I took a quick look and, if I got it correct, there is one single
line change in the "old" client between 9.2 and 9.3, which defined
an otherwise unused mount flag called NFSMNT_NONCONTIGWR. (It is
only used by the new client when "nocontigwr" is specified.)
However, there was some fairly extensive changes done (mostly by mav@)
to the kernel rpc (sys/rpc), which is used by both clients and both
servers.
Most of these changes were committed to stable/9 as r261057, r261058.
If you could build a kernel from stable/9 just prior to r261057 and see
if that client runs into the problem, it could help determine if these
changes are causing the problem.
Alternately, running the 9.3 system with a 9.2 sys/rpc (if it links/runs),
that could also help see if the kernel rpc is the culprit. (You can
load the kernel rpc as a module, but it's linked into most kernels.)
If it doesn't turn out to be in the kernel rpc, my next guess would
be changes to the net device driver (to check for this you could use
a different type of hardware device or the 9.2 driver on the 9.3 system.
maybe?).
The "new" client has some changes 9.2->9.3, but since nothing
changed
for the "old" client and you see the problem with the "old"
one, I
think the NFS client is not the culprit.
rick
> If anyone is willing to help debug this, I can share a packet trace,
> but I don't think it's very informative. Also, if anyone has a
good
> dtrace script that I could run on the server that would report what's
> going on when that first NFS3ERR_IO is returned, that would be great.
>
> -GAWollman
>
> [1] I'd run my own freebsd-update server but unfortunately it is too
> tied to building things that look like official FreeBSD security
> updates, and isn't really designed for (e.g.) updating kernels when
> we
> change a configuration option. It also doesn't have any obvious
> knobs
> for building with anything other than a default {make,src}.conf.
> And with a pkg-able base just around the corner I don't really want
> to
> put much effort into making freebsd-update do what I want. NFS, on
> the other hand, is a big deal and so I need to track down and fix
> these bugs.
>