So I'm running nfs to get content to my web servers. Now I've had this problem 2 times (about 2 weeks since the last occurrence). I use drbd on the nfs server for redundancy. Now to my problem: All my web sites stopped responding so I started by checking dmesg and there I found a bunch of this errors || Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out But when checking the nfs server lockd was running and I could access all the files from the webserver with ls, cd etc. The logs on the nfs server doesn't say anything of interest and checking apaches error_log just says "not found or unable to stat". Now I mentioned this have happened 2 times and both these times I've "solved" it by rebooting the nfs server and web servers. This isn't a good solution to have to reboot my servers every couple of weeks so I really could use some help. :) Also I get this from time to time on the web servers, dunno if it's related. /do_vfs_lock: VFS is out of sync with lock manager! / -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos/attachments/20080812/77f24867/attachment-0005.html>
It happend again this night but now I temporarily(?) fixed it with mounting -o nolock on the web servers. It works but dmesg is still spamming "lockd: server 192.168.20.22 not responding, timed out". Atleast my sites are up, and the message isn't critical anymore. But how can I get rid of it? Johan Swensson wrote:> So I'm running nfs to get content to my web servers. Now I've had this > problem 2 times (about 2 weeks since the last occurrence). > I use drbd on the nfs server for redundancy. Now to my problem: > > All my web sites stopped responding so I started by checking dmesg and > there I found a bunch of this errors > || > Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out > Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out > > But when checking the nfs server lockd was running and I could access > all the files from the webserver with ls, cd etc. > > The logs on the nfs server doesn't say anything of interest and > checking apaches error_log just says "not found or unable to stat". > > Now I mentioned this have happened 2 times and both these times I've > "solved" it by rebooting the nfs server and web servers. This isn't a > good solution to have to reboot my servers every couple of weeks so I > really could use some help. :) > > Also I get this from time to time on the web servers, dunno if it's > related. > /do_vfs_lock: VFS is out of sync with lock manager! / > ------------------------------------------------------------------------ > > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos/attachments/20080813/84bd081a/attachment-0005.html>
On Tue, 2008-08-12 at 14:27 +0200, Johan Swensson wrote:> So I'm running nfs to get content to my web servers. Now I've had this > problem 2 times (about 2 weeks since the last occurrence). > I use drbd on the nfs server for redundancy. Now to my problem: > > All my web sites stopped responding so I started by checking dmesg and > there I found a bunch of this errors > Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out > Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out > > But when checking the nfs server lockd was running and I could access > all the files from the webserver with ls, cd etc. > > The logs on the nfs server doesn't say anything of interest and > checking apaches error_log just says "not found or unable to stat". > > Now I mentioned this have happened 2 times and both these times I've > "solved" it by rebooting the nfs server and web servers. This isn't a > good solution to have to reboot my servers every couple of weeks so I > really could use some help. :) > > Also I get this from time to time on the web servers, dunno if it's > related. > do_vfs_lock: VFS is out of sync with lock manager!---- I too have been having the same issues with my nfs server - which seems to have started when I updated on July 27th (5.2) It seems to happen after logrotate on Sunday morning but I didn't know about it until users show up on Monday mornings. /var/log/messages has... Aug 4 09:32:59 cube kernel: lockd: server HOSTNAME not responding, still trying and like you, I've rebooted the main server each time (Monday mornings)...there's something wrong that I can't figure out Craig
Johan Swensson wrote:> It happend again this night but now I temporarily(?) fixed it with > mounting -o nolock on the web servers. > It works but dmesg is still spamming "lockd: server 192.168.20.22 not > responding, timed out". Atleast my sites are up, and the message isn't > critical anymore. > But how can I get rid of it?What does 'rpcinfo -p' read on both the servers and the clients? Also how about /etc/init.d/nfs status (both client and server) and /etc/init.d/nfslock status (both client and server) Any firewalls in between client and server? Run: iptables -L -n (on both client and server) nate
On Tue, 2008-08-12 at 14:27 +0200, Johan Swensson wrote:> So I'm running nfs to get content to my web servers. Now I've had this > problem 2 times (about 2 weeks since the last occurrence). > I use drbd on the nfs server for redundancy. Now to my problem: > > All my web sites stopped responding so I started by checking dmesg and > there I found a bunch of this errors > Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out > Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out > > But when checking the nfs server lockd was running and I could access > all the files from the webserver with ls, cd etc.This is the exact problem we were having here. Rebooting is the only solution. And as already mentioned further down the thread it was attributed to this https://bugzilla.redhat.com/show_bug.cgi?id=453094 My solution was to extract the patch from the upstream kernel in http://people.redhat.com/dzickus/el5/103.el5/src/ called linux-2.6-fs-lockd-nlmsvc_lookup_host-called-with-f_sema-held.patch and reroll the latest centosplus kernel srpm with it. Servers have been fine for 6 days running this kernel. As much as I hate carrying custom kernel rpms this is a showstopper for us, and it looks like it won't make in until 5.3. Personally given the limited scope of the patch and apparent unwillingness of redhat to include it in an update I'd advocate CentOS carrying it as a custom patch. Here's my srpm if anyone wants it, http://magoazul.com/tmp/kernel-2.6.18-92.1.10.1.el5.centos.plus.src.rpm the only change is the patch for this issue. Everything builds cleanly via mock. -- Matthew Kent \ SA \ bravenet.com