Hello, I have been in a trouble about my NFS server for a long time. The symptom is that it stops working in one or two weeks after a boot. I could not track down the cause yet, but it is reproducible and only occurred under a very high I/O load. It did not panic, just stopped working---while it responded to ping, userland programs seemed not working. I could break it into DDB and get a kernel dump. The following URLs are a log of ps, trace, and etc.: http://people.allbsd.org/~hrs/FreeBSD/pool.log.20130102 http://people.allbsd.org/~hrs/FreeBSD/pool.dmesg.20130102 Does anyone see how to debug this? I guess this is due to a deadlock somewhere. I have suffered from this problem for almost two years. The above log is from stable/9 as of Dec 19, but this have persisted since 8.X. -- Hiroki -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20130102/84a6c04d/attachment.sig>
Hiroki Sato <hrs at freebsd.org> wrote:> I have been in a trouble about my NFS server for a long time. > The symptom is that it stops working in one or two weeks after > a boot ... It did not panic, just stopped working---while it > responded to ping, userland programs seemed not working ...> Does anyone see how to debug this? I guess this is due to a > deadlock somewhere ...If you can afford the overhead, you could try running with some of the kernel debug options enabled (e.g. WITNESS, INVARIANTS, MUTEX_DEBUG). See conf/NOTES for descriptions.
Hiroki Sato wrote:> Hello, > > I have been in a trouble about my NFS server for a long time. The > symptom is that it stops working in one or two weeks after a boot. I > could not track down the cause yet, but it is reproducible and only > occurred under a very high I/O load. > > It did not panic, just stopped working---while it responded to ping, > userland programs seemed not working. I could break it into DDB and > get a kernel dump. The following URLs are a log of ps, trace, and > etc.: > > http://people.allbsd.org/~hrs/FreeBSD/pool.log.20130102 > http://people.allbsd.org/~hrs/FreeBSD/pool.dmesg.20130102 > > Does anyone see how to debug this? I guess this is due to a deadlock > somewhere. I have suffered from this problem for almost two years. > The above log is from stable/9 as of Dec 19, but this have persisted > since 8.X. >Well, I took a quick glance at the log and there are a lot of processes sleeping on "pfault" (in vm_waitpfault() in sys/vm/vm_page.c). I'm no vm guy, so I'm not sure when/why that will happen. The comment on the function suggests they are waiting for free pages. Maybe something as simple as running out of swap space or a problem talking to the disk(s) that has the swap partition(s) or ??? (I'm talking through my hat here, because I'm not conversant with the vm side of things.) I might take a closer look this evening and see if I can spot anything in the log, rick ps: I hope Alan and Kostik don't mind being added to the cc list.> -- Hiroki
Andriy Gapon wrote:> on 29/01/2013 23:44 Hiroki Sato said the following: > > http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt > > http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt > > I recognize here a ZFS ARC deadlock that should have been prevented by > r241773 > and its MFCs (r242858 for 9, r242859 for 8). >Unfortunately, pool-20130130-info.txt shows a kernel built from r244417, unless I somehow misread it. rick> See tid 100153 (arc reclaim thread), tid 100105 (pagedaemon) and tid > 100639 > (nfsd in kmem_back). > > -- > Andriy Gapon