I''ve checked in a fix to the 1.2 and unstable trees for a problem we discovered yesterday with /dev/random. Basically, the virtual drivers weren''t adding entropy to the kernel entropy pool, which tended to mean that /dev/random blocked for long periods of time. The problem was particularly acute with NFS root systems, where with no entropy input /dev/random blocked forever. If you were having problems with Apache being slow to start (listening on port 80, but not servicing requests), you should find the problem goes away the latest tar balls. Ian ------------------------------------------------------- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we''ll give you the exam FREE. http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Hi All, My goodness! See the message I just now posted to xen-devel about NFS root hangs; could this be what we''re hitting? The most recent hang we saw happened while an rsync was running over ssh *and* someone restarted apache... This wouldn''t cause the "NFS server not responding/NFS server OK" messages on the domain''s console, though (or does that show up as a symptom of this too?) Steve On Wed, May 05, 2004 at 09:36:08AM +0100, Ian Pratt wrote:> > I''ve checked in a fix to the 1.2 and unstable trees for a problem > we discovered yesterday with /dev/random. Basically, the virtual > drivers weren''t adding entropy to the kernel entropy pool, which > tended to mean that /dev/random blocked for long periods of > time. The problem was particularly acute with NFS root systems, > where with no entropy input /dev/random blocked forever. > > If you were having problems with Apache being slow to start > (listening on port 80, but not servicing requests), you should > find the problem goes away the latest tar balls. > > > Ian > > > ------------------------------------------------------- > This SF.Net email is sponsored by: Oracle 10g > Get certified on the hottest thing ever to hit the market... Oracle 10g. > Take an Oracle 10g class now, and we''ll give you the exam FREE. > http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel------------------------------------------------------- This SF.Net email is sponsored by: SourceForge.net Broadband Sign-up now for SourceForge Broadband and get the fastest 6.0/768 connection for only $19.95/mo for the first 3 months! http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
>My goodness! See the message I just now posted to xen-devel about NFS >root hangs; could this be what we''re hitting? The most recent hang we >saw happened while an rsync was running over ssh *and* someone restarted >apache... > >This wouldn''t cause the "NFS server not responding/NFS server OK" >messages on the domain''s console, though (or does that show up as a >symptom of this too?)I don''t think this is the cause of the NFS hangs you''ve been seeing; that appears to be a generic linux thing (at least we see it with our regular linux boxes as well as with xen boxes); however if you want to test the theory the easiest thing to do is to change the /dev/random device node to be an alias for /dev/urandom (a non-blocking but potentiallyweaker source of randomness). The /dev/random bug only really manifested for us during boot, only on Xen, and resulted in a permanenent hang. The "NFS server foo not responding" followed by later "NFS server foo OK" messages from linux appear to be due to a combination of stupid timeouts in the linux sunrpc code and another bug which can cause automounters to fall into an uninterruptible sleep. If you check "ps auwwx" on a machine which is having problems and notice proceesses in state ''D'' then this is biting you. Even if this doesn''t occur, the crappy timeouts in the regular linux code mean that linux perfroms very badly if it gets any errors/loss/congestion during nfs operations. cheers, S.
Are you also using Linux as an NFS server? We use Linux extensively in-house for client machines and have not seen this. Although I''m sure we don''t use the default Linux settings. -Kip On Thu, 13 May 2004, Steven Hand wrote:> > >My goodness! See the message I just now posted to xen-devel about NFS > >root hangs; could this be what we''re hitting? The most recent hang we > >saw happened while an rsync was running over ssh *and* someone restarted > >apache... > > > >This wouldn''t cause the "NFS server not responding/NFS server OK" > >messages on the domain''s console, though (or does that show up as a > >symptom of this too?) > > I don''t think this is the cause of the NFS hangs you''ve been seeing; that > appears to be a generic linux thing (at least we see it with our regular > linux boxes as well as with xen boxes); however if you want to test the > theory the easiest thing to do is to change the /dev/random device node > to be an alias for /dev/urandom (a non-blocking but potentiallyweaker > source of randomness). > > The /dev/random bug only really manifested for us during boot, only on > Xen, and resulted in a permanenent hang. > > The "NFS server foo not responding" followed by later "NFS server foo OK" > messages from linux appear to be due to a combination of stupid timeouts > in the linux sunrpc code and another bug which can cause automounters > to fall into an uninterruptible sleep. If you check "ps auwwx" on a > machine which is having problems and notice proceesses in state ''D'' > then this is biting you. Even if this doesn''t occur, the crappy timeouts > in the regular linux code mean that linux perfroms very badly if it gets > any errors/loss/congestion during nfs operations. > > cheers, > > S. > > > ------------------------------------------------------- > This SF.Net email is sponsored by: SourceForge.net Broadband > Sign-up now for SourceForge Broadband and get the fastest > 6.0/768 connection for only $19.95/mo for the first 3 months! > http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel >
The server is debian woody, 2.4.21. I''ve never seen any obvious way to actually set any timeout, block size, or other parameters for an NFS root partition -- and it seems to ignore whatever''s in fstab, which makes sense. Right now the only reason I''m even using NFS is because a Xenoserver provider needs to be able to do backups, migration, failover, and so on. How are other people meeting these requirements? Has the CoW development stalled? What about live migration? Steve On Thu, May 13, 2004 at 07:54:57AM -0700, Kip Macy wrote:> Are you also using Linux as an NFS server? We use Linux extensively > in-house for client machines and have not seen this. Although I''m sure > we don''t use the default Linux settings. > > > -Kip > > > On Thu, 13 May 2004, Steven Hand wrote: > > > > > >My goodness! See the message I just now posted to xen-devel about NFS > > >root hangs; could this be what we''re hitting? The most recent hang we > > >saw happened while an rsync was running over ssh *and* someone restarted > > >apache... > > > > > >This wouldn''t cause the "NFS server not responding/NFS server OK" > > >messages on the domain''s console, though (or does that show up as a > > >symptom of this too?) > > > > I don''t think this is the cause of the NFS hangs you''ve been seeing; that > > appears to be a generic linux thing (at least we see it with our regular > > linux boxes as well as with xen boxes); however if you want to test the > > theory the easiest thing to do is to change the /dev/random device node > > to be an alias for /dev/urandom (a non-blocking but potentiallyweaker > > source of randomness). > > > > The /dev/random bug only really manifested for us during boot, only on > > Xen, and resulted in a permanenent hang. > > > > The "NFS server foo not responding" followed by later "NFS server foo OK" > > messages from linux appear to be due to a combination of stupid timeouts > > in the linux sunrpc code and another bug which can cause automounters > > to fall into an uninterruptible sleep. If you check "ps auwwx" on a > > machine which is having problems and notice proceesses in state ''D'' > > then this is biting you. Even if this doesn''t occur, the crappy timeouts > > in the regular linux code mean that linux perfroms very badly if it gets > > any errors/loss/congestion during nfs operations. > > > > cheers, > > > > S. > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by: SourceForge.net Broadband > > Sign-up now for SourceForge Broadband and get the fastest > > 6.0/768 connection for only $19.95/mo for the first 3 months! > > http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/xen-devel > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: SourceForge.net Broadband > Sign-up now for SourceForge Broadband and get the fastest > 6.0/768 connection for only $19.95/mo for the first 3 months! > http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel-- Stephen G. Traugott (KG6HDQ) UNIX/Linux Infrastructure Architect, TerraLuna LLC stevegt@TerraLuna.Org http://www.stevegt.com -- http://Infrastructures.Org
> I''ve never seen any obvious way to actually set any timeout, block > size, or other parameters for an NFS root partition -- and it seems to > ignore whatever''s in fstab, which makes sense.What happens if you do a "mount -o remount" on an NFS root? Is it just ignored? It is possible to set some options on the command line. See linux-2.4.26/Documentation/nfsroot.txt> Right now the only reason I''m even using NFS is because a Xenoserver > provider needs to be able to do backups, migration, failover, and so on. > How are other people meeting these requirements? Has the CoW > development stalled?NFS root should be a good strategy -- its unfortunate the Linux code has problems. If you can find a reliable way of triggering the Linux lockup, we''ll have a sporting chance of being able to fix it, and hopefully get a patch into the mainline tree. Bin Ren developed a CoW block device driver, but I don''t think its received a huge amount of testing. Bin: could you check this in? We also have a user-space CoW NFS server that runs in domain0 and exports file systems to other domains. This is undergoing testing right now.> What about live migration?Live migration is now working nicely -- I''ve got "one last bug" that effects SMP systems then I''ll check it in. Ian
> > Right now the only reason I''m even using NFS is because a Xenoserver > provider needs to be able to do backups, migration, failover, and so on. > How are other people meeting these requirements? Has the CoW > development stalled? What about live migration?I think iSCSI is the way to go. However, I don''t know of any good open source iSCSI targets. -Kip> > Steve > > On Thu, May 13, 2004 at 07:54:57AM -0700, Kip Macy wrote: > > Are you also using Linux as an NFS server? We use Linux extensively > > in-house for client machines and have not seen this. Although I''m sure > > we don''t use the default Linux settings. > > > > > > -Kip > > > > > > On Thu, 13 May 2004, Steven Hand wrote: > > > > > > > > >My goodness! See the message I just now posted to xen-devel about NFS > > > >root hangs; could this be what we''re hitting? The most recent hang we > > > >saw happened while an rsync was running over ssh *and* someone restarted > > > >apache... > > > > > > > >This wouldn''t cause the "NFS server not responding/NFS server OK" > > > >messages on the domain''s console, though (or does that show up as a > > > >symptom of this too?) > > > > > > I don''t think this is the cause of the NFS hangs you''ve been seeing; that > > > appears to be a generic linux thing (at least we see it with our regular > > > linux boxes as well as with xen boxes); however if you want to test the > > > theory the easiest thing to do is to change the /dev/random device node > > > to be an alias for /dev/urandom (a non-blocking but potentiallyweaker > > > source of randomness). > > > > > > The /dev/random bug only really manifested for us during boot, only on > > > Xen, and resulted in a permanenent hang. > > > > > > The "NFS server foo not responding" followed by later "NFS server foo OK" > > > messages from linux appear to be due to a combination of stupid timeouts > > > in the linux sunrpc code and another bug which can cause automounters > > > to fall into an uninterruptible sleep. If you check "ps auwwx" on a > > > machine which is having problems and notice proceesses in state ''D'' > > > then this is biting you. Even if this doesn''t occur, the crappy timeouts > > > in the regular linux code mean that linux perfroms very badly if it gets > > > any errors/loss/congestion during nfs operations. > > > > > > cheers, > > > > > > S. > > > > > > > > > ------------------------------------------------------- > > > This SF.Net email is sponsored by: SourceForge.net Broadband > > > Sign-up now for SourceForge Broadband and get the fastest > > > 6.0/768 connection for only $19.95/mo for the first 3 months! > > > http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel@lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/xen-devel > > > > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by: SourceForge.net Broadband > > Sign-up now for SourceForge Broadband and get the fastest > > 6.0/768 connection for only $19.95/mo for the first 3 months! > > http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/xen-devel > > -- > Stephen G. Traugott (KG6HDQ) > UNIX/Linux Infrastructure Architect, TerraLuna LLC > stevegt@TerraLuna.Org > http://www.stevegt.com -- http://Infrastructures.Org > > > ------------------------------------------------------- > This SF.Net email is sponsored by: SourceForge.net Broadband > Sign-up now for SourceForge Broadband and get the fastest > 6.0/768 connection for only $19.95/mo for the first 3 months! > http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel >
> > I''ve never seen any obvious way to actually set any timeout, block > size, or other parameters for an NFS root partition -- and it seems to > ignore whatever''s in fstab, which makes sense. >Have you considered doing a -o remount early in boot passing a different set of options? These are the options we use on linux: defaults,intr,rsize=8192,wsize=8192,nfsvers=3,tcp,timeo=600 We obviously don''t use Linux as an NFS server so YMMV. -Kip