Hello All, We''re seeing an interesting behavior in our homol infrastructure. The test bed is: - 3 Hosts (M610 Blades) with XENSOURCE 4.1 over CentOS (it''s reproductible with 4.02, 4.1.1, etc) - XEN1 to XEN3 - 2 servers acting as NFS servers hosting Windows and Linux DOMUs (Server1 and Server2) Each XEN has DOMUs running from both Server1 and Server2 (for example, 4 DOMUs running from Server1 and 4 DOMUs running from Server2). DOMUs are both Windows PV-on-HVM and Linux PV-on-HVM (no PV images) Everything works very well until we shutdown one of the Servers (Server1, for example). As expected, the associated DOMUs stops completely until we turn Server1 on again. Now, here''s the funny thing: Some of the XEN HOSTS stop working completely! A simple "w" command issued in bash hangs until we issue a CONTROL C. The only solution is to restart the whole server. We still could not found out some pattern. We expected at least the Server2''s DOMUs keep running even if the Server1''s DOMUs go offline, but no luck. Has anybody here seen this ? Many thanks ! _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Mon, Aug 22, 2011 at 8:47 AM, Antonio Pina (antonio.pina) <antonio.pina@alog.com.br> wrote:> - 2 servers acting as NFS servers hosting Windows and Linux DOMUs > (Server1 and Server2)> Everything works very well until we shutdown one of the Servers (Server1, > for example). As expected, the associated DOMUs stops completely until we > turn Server1 on again.> Now, here’s the funny thing: Some of the XEN HOSTS stop working completely! > A simple “w” command issued in bash hangs until we issue a CONTROL C. The > only solution is to restart the whole server.> We still could not found out some pattern. We expected at least the > Server2’s DOMUs keep running even if the Server1’s DOMUs go offline, but no > luck. >That''s not how nfs works (not by default anyway)> > > Has anybody here seen this ? Many thanks ! >It''s a general issue with nfs (not xen-specific). The default behaviour of nfs in case of error is "report ''server not responding'' on the console and continue retrying indefinitely" and "not allow file operations to be interrupted." (see http://linux.die.net/man/5/nfs) That behaviour can take up enough cpu cycles that you''ll be unable to do anything else (in your example, the "w" command). You MIGHT be able to work around this by explicitly using "soft" and "intr" mount option. In that case you will have to manually remount the nfs share and restart any programs currently using it. -- Fajar _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Please keep cc to the list. On Mon, Aug 22, 2011 at 9:16 AM, <brian@krusic.com> wrote:> Fajar, > > Would a workaround be autofs instead if static mount? > > Perhaps use a really short timeout value as it won''t expire dues to > constant access unless ts down.Not sure. Try it and see. My GUESS is that automount will not be able to offer any improvement as it would see that the nfs share is still mounted (even with I/O errors). -- Fajar _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Antonio Pina (antonio.pina)
2011-Aug-22 02:27 UTC
RE: [Xen-users] NFS goes down, XEN hangs
Thank you for your answer. I understand and agree with your point but it''s not the case. As soon as Server1 comes back, the NFS directory becomes available again (ls -l /server1 works). The CPU is low as usual and I can use some commands like "xl list", but now other commands, like "w" I said before. Thank you. -----Original Message----- From: xen-users-bounces@lists.xensource.com [mailto:xen-users-bounces@lists.xensource.com] On Behalf Of Fajar A. Nugraha Sent: Sunday, August 21, 2011 11:12 PM To: xen-users@lists.xensource.com Subject: Re: [Xen-users] NFS goes down, XEN hangs On Mon, Aug 22, 2011 at 8:47 AM, Antonio Pina (antonio.pina) <antonio.pina@alog.com.br> wrote:> - 2 servers acting as NFS servers hosting Windows and Linux > DOMUs > (Server1 and Server2)> Everything works very well until we shutdown one of the Servers > (Server1, for example). As expected, the associated DOMUs stops > completely until we turn Server1 on again.> Now, here''s the funny thing: Some of the XEN HOSTS stop working completely! > A simple "w" command issued in bash hangs until we issue a CONTROL C. > The only solution is to restart the whole server.> We still could not found out some pattern. We expected at least the > Server2''s DOMUs keep running even if the Server1''s DOMUs go offline, > but no luck. >That''s not how nfs works (not by default anyway)> > > Has anybody here seen this ? Many thanks ! >It''s a general issue with nfs (not xen-specific). The default behaviour of nfs in case of error is "report ''server not responding'' on the console and continue retrying indefinitely" and "not allow file operations to be interrupted." (see http://linux.die.net/man/5/nfs) That behaviour can take up enough cpu cycles that you''ll be unable to do anything else (in your example, the "w" command). You MIGHT be able to work around this by explicitly using "soft" and "intr" mount option. In that case you will have to manually remount the nfs share and restart any programs currently using it. -- Fajar _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Mon, Aug 22, 2011 at 9:27 AM, Antonio Pina (antonio.pina) <antonio.pina@alog.com.br> wrote:> Thank you for your answer. > > I understand and agree with your point but it''s not the case. As soon as Server1 comes back, the NFS directory becomes available again (ls -l /server1 works). The CPU is low as usual and I can use some commands like "xl list",the default nfs behaviour does that> but now other commands, like "w" I said before.You might be able to get better help from other people with more nfs expertise. My best guess is that "w" is somehow trying to interact with the process accessing the stale nfs mount (at least "strace w" shows it''s accessing /proc/[pid]/stat), and since the process is uninterruptible "w" has to wait (thus the appearance of "hang"). -- Fajar _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
I think your mail is intended for Antonio. Forwarded to the list. -- Fajar On Mon, Aug 22, 2011 at 10:00 AM, Andrew Wells <agwells0714@gmail.com> wrote:> If I got this correct. You are using xen to host nfs severs that you are > using to netboot other xen machines? And if a nfs server goes down the > associated netbooted machines become unresponsive? > > I found this behavior all the time in literal ( not virtual ) environments > so I doubt its xen. I would instead use highly available nfs instead. And > maybe not use a virtual machine to host the nfs. > > On Aug 21, 2011 10:46 PM, "Fajar A. Nugraha" <list@fajar.net> wrote: >> On Mon, Aug 22, 2011 at 9:27 AM, Antonio Pina (antonio.pina) >> <antonio.pina@alog.com.br> wrote: >>> Thank you for your answer. >>> >>> I understand and agree with your point but it''s not the case. As soon as >>> Server1 comes back, the NFS directory becomes available again (ls -l >>> /server1 works). The CPU is low as usual and I can use some commands like >>> "xl list", >> >> the default nfs behaviour does that >> >>> but now other commands, like "w" I said before. >> >> You might be able to get better help from other people with more nfs >> expertise. My best guess is that "w" is somehow trying to interact >> with the process accessing the stale nfs mount (at least "strace w" >> shows it''s accessing /proc/[pid]/stat), and since the process is >> uninterruptible "w" has to wait (thus the appearance of "hang"). >> >> -- >> Fajar >> >> _______________________________________________ >> Xen-users mailing list >> Xen-users@lists.xensource.com >> http://lists.xensource.com/xen-users >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Antonio Pina (antonio.pina)
2011-Aug-22 12:59 UTC
RE: [Xen-users] NFS goes down, XEN hangs
Actually "strace w" did the trick and showed me "w" is hanging when trying to read from "/usr/sbin/tapdisk2" referencing "Server1". It seems that NFS comes back online by itself (I can "ls" inside dom0, as I said), but not tapdisk2. Probably tapdisk2 is the one to be blamed. Probably this belongs to xen-devel... -----Original Message----- From: xen-users-bounces@lists.xensource.com [mailto:xen-users-bounces@lists.xensource.com] On Behalf Of Fajar A. Nugraha Sent: Sunday, August 21, 2011 11:45 PM To: xen-users@lists.xensource.com Subject: Re: [Xen-users] NFS goes down, XEN hangs On Mon, Aug 22, 2011 at 9:27 AM, Antonio Pina (antonio.pina) <antonio.pina@alog.com.br> wrote:> Thank you for your answer. > > I understand and agree with your point but it''s not the case. As soon > as Server1 comes back, the NFS directory becomes available again (ls > -l /server1 works). The CPU is low as usual and I can use some > commands like "xl list",the default nfs behaviour does that> but now other commands, like "w" I said before.You might be able to get better help from other people with more nfs expertise. My best guess is that "w" is somehow trying to interact with the process accessing the stale nfs mount (at least "strace w" shows it''s accessing /proc/[pid]/stat), and since the process is uninterruptible "w" has to wait (thus the appearance of "hang"). -- Fajar _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users