thr3ads.net - CentOS - [CentOS] NFS issues [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Johan Swensson

2008-Aug-12 12:27 UTC

[CentOS] NFS issues

So I'm running nfs to get content to my web servers. Now I've had this 
problem 2 times (about 2 weeks since the last occurrence).
I use drbd on the nfs server for redundancy. Now to my problem:

All my web sites stopped responding so I started by checking dmesg and 
there I found a bunch of this errors
||

Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed
out
Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed
out


But when checking the nfs server lockd was running and I could access 
all the files from the webserver with ls, cd etc.

The logs on the nfs server doesn't say anything of interest and checking 
apaches error_log just says "not found or unable to stat".

Now I mentioned this have happened 2 times and both these times I've 
"solved" it by rebooting the nfs server and web servers. This
isn't a
good solution to have to reboot my servers every couple of weeks so I 
really could use some help. :)

Also I get this from time to time on the web servers, dunno if it's related.
/do_vfs_lock: VFS is out of sync with lock manager! /
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20080812/77f24867/attachment-0005.html>

Johan Swensson

2008-Aug-13 02:38 UTC

head link

[CentOS] NFS issues

It happend again this night but now I temporarily(?) fixed it with 
mounting -o nolock on the web servers.
It works but dmesg is still spamming "lockd: server 192.168.20.22 not 
responding, timed out". Atleast my sites are up, and the message isn't 
critical anymore.
But how can I get rid of it?

Johan Swensson wrote:> So I'm running nfs to get content to my web servers. Now I've had
this
> problem 2 times (about 2 weeks since the last occurrence).
> I use drbd on the nfs server for redundancy. Now to my problem:
>
> All my web sites stopped responding so I started by checking dmesg and 
> there I found a bunch of this errors
> ||
> Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding,
timed out
> Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding,
timed out
>
> But when checking the nfs server lockd was running and I could access 
> all the files from the webserver with ls, cd etc.
>
> The logs on the nfs server doesn't say anything of interest and 
> checking apaches error_log just says "not found or unable to
stat".
>
> Now I mentioned this have happened 2 times and both these times I've 
> "solved" it by rebooting the nfs server and web servers. This
isn't a
> good solution to have to reboot my servers every couple of weeks so I 
> really could use some help. :)
>
> Also I get this from time to time on the web servers, dunno if it's 
> related.
> /do_vfs_lock: VFS is out of sync with lock manager! /
> ------------------------------------------------------------------------
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20080813/84bd081a/attachment-0005.html>

Craig White

2008-Aug-13 02:52 UTC

head link

[CentOS] NFS issues

On Tue, 2008-08-12 at 14:27 +0200, Johan Swensson wrote:> So I'm running nfs to get content to my web servers. Now I've had
this
> problem 2 times (about 2 weeks since the last occurrence).
> I use drbd on the nfs server for redundancy. Now to my problem:
> 
> All my web sites stopped responding so I started by checking dmesg and
> there I found a bunch of this errors
> Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding,
timed out
> Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding,
timed out
> 
> But when checking the nfs server lockd was running and I could access
> all the files from the webserver with ls, cd etc.
> 
> The logs on the nfs server doesn't say anything of interest and
> checking apaches error_log just says "not found or unable to
stat".
> 
> Now I mentioned this have happened 2 times and both these times I've
> "solved" it by rebooting the nfs server and web servers. This
isn't a
> good solution to have to reboot my servers every couple of weeks so I
> really could use some help. :)
> 
> Also I get this from time to time on the web servers, dunno if it's
> related.
> do_vfs_lock: VFS is out of sync with lock manager!----
I too have been having the same issues with my nfs server - which seems
to have started when I updated on July 27th (5.2)

It seems to happen after logrotate on Sunday morning but I didn't know
about it until users show up on Monday mornings.

/var/log/messages has...

Aug  4 09:32:59 cube kernel: lockd: server HOSTNAME not responding,
still trying

and like you, I've rebooted the main server each time (Monday
mornings)...there's something wrong that I can't figure out

Craig

nate

2008-Aug-13 03:16 UTC

head link

[CentOS] NFS issues

Johan Swensson wrote:> It happend again this night but now I temporarily(?) fixed it with
> mounting -o nolock on the web servers.
> It works but dmesg is still spamming "lockd: server 192.168.20.22 not
> responding, timed out". Atleast my sites are up, and the message
isn't
> critical anymore.
> But how can I get rid of it?
What does 'rpcinfo -p' read on both the servers and the clients?

Also how about /etc/init.d/nfs status (both client and server)
and /etc/init.d/nfslock status (both client and server)

Any firewalls in between client and server?
Run: iptables -L -n (on both client and server)

nate

Matthew Kent

2008-Aug-13 16:27 UTC

head link

[CentOS] NFS issues

On Tue, 2008-08-12 at 14:27 +0200, Johan Swensson wrote:> So I'm running nfs to get content to my web servers. Now I've had
this
> problem 2 times (about 2 weeks since the last occurrence).
> I use drbd on the nfs server for redundancy. Now to my problem:
> 
> All my web sites stopped responding so I started by checking dmesg and
> there I found a bunch of this errors
> Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding,
timed out
> Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding,
timed out
> 
> But when checking the nfs server lockd was running and I could access
> all the files from the webserver with ls, cd etc.
This is the exact problem we were having here. Rebooting is the only
solution.

And as already mentioned further down the thread it was attributed to
this https://bugzilla.redhat.com/show_bug.cgi?id=453094

My solution was to extract the patch from the upstream kernel in 
http://people.redhat.com/dzickus/el5/103.el5/src/
called
linux-2.6-fs-lockd-nlmsvc_lookup_host-called-with-f_sema-held.patch

and reroll the latest centosplus kernel srpm with it. Servers have been
fine for 6 days running this kernel.

As much as I hate carrying custom kernel rpms this is a showstopper for
us, and it looks like it won't make in until 5.3. 

Personally given the limited scope of the patch and apparent
unwillingness of redhat to include it in an update I'd advocate CentOS
carrying it as a custom patch.

Here's my srpm if anyone wants it, 
http://magoazul.com/tmp/kernel-2.6.18-92.1.10.1.el5.centos.plus.src.rpm
the only change is the patch for this issue. Everything builds cleanly
via mock. 
-- 
Matthew Kent \ SA \ bravenet.com

Possibly Parallel Threads

Search for more apparently analagous threads

CentOS - Aug 2008 - NFS issues

[CentOS] NFS issues

[CentOS] NFS issues

[CentOS] NFS issues

[CentOS] NFS issues

[CentOS] NFS issues

Possibly Parallel Threads