Osma Suominen
2005-Jun-02 10:22 UTC
[Xen-devel] wget and Zope crashes on post-2.0.6 -testing
Hello, I reported about time-related problems some days ago, with no replies: http://thread.gmane.org/gmane.comp.emulators.xen.devel/10628 I have problems with e.g. wget and Zope crashing on domU on a recent -testing build. This is on a Debian Sarge system, with kernel 2.6.11.11 and a Xen -testing snapshot from two days ago (2005-05-31). The problems are not as easy to trigger as with earlier versions (e.g. the 2.0.5 demo CD), but they do happen. The symptom is that during heavy load, wget crashes with the message "acalc_rate: Assertion `msecs >= 0'' failed", which probably means that time has stepped backwards (looking at earlier xen-devel posts). Also, Zope frequently dies with different time-related error messages. Here''s the end of a typical traceback: File "/usr/lib/zope2.7/lib/python/DateTime/DateTime.py", line 694, in _parse_args lt = safelocaltime(t) File "/usr/lib/zope2.7/lib/python/DateTime/DateTime.py", line 437, in safelocaltime raise TimeError, ''The time %f is beyond the range '' \ TimeError: The time nan is beyond the range of this Python implementation. It is fairly easy to crash Zope this way by using a tool such as apache''s benchmarking utility ab/ab2 or wget to pound on it. It usually takes a few minutes on an otherwise unloaded machine to bring down Zope. Note that Zope runs just fine on a similar native Linux system, and after running production Zope systems for more than a year, I have never seen the kind of errors Zope on Xen brings up. To cause the wget error (which I think is a symptom of a very similar problem), it is easiest to run SETI@Home which will put enough load on the system. It might take a few attempts but I can always crash wget this way when SETI is running. It is my impression that these problems occur during bursts of high timer interrupt activity, but I haven''t made detailed studies. Is there anything I can do to help sort out this? For example, would it be a good idea to test unstable to see if it exhibits this behavior? Any help is appreciated, and since I soon need to run a production Zope system on several Xen hosts, I would like to find a solution to the frequent crashes. -Osma -- *** Osma Suominen / MB Concert Ky *** osma.suominen@mbconcert.fi *** _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2005-Jun-02 13:50 UTC
RE: [Xen-devel] wget and Zope crashes on post-2.0.6 -testing
> I reported about time-related problems some days ago, with no replies: > http://thread.gmane.org/gmane.comp.emulators.xen.devel/10628 > > I have problems with e.g. wget and Zope crashing on domU on a > recent -testing build. This is on a Debian Sarge system, with > kernel 2.6.11.11 and a Xen -testing snapshot from two days > ago (2005-05-31). The problems are not as easy to trigger as > with earlier versions (e.g. the 2.0.5 demo CD), but they do happen. > > The symptom is that during heavy load, wget crashes with the message > "acalc_rate: Assertion `msecs >= 0'' failed", which probably > means that time has stepped backwards (looking at earlier > xen-devel posts).I can''t reproduce this wget crash, even running seti@home in the background as you suggest. I''m running this in dom0 on an SMP Xeon box. Are you running NTP on your system? If so, what does "echo peers | ntpq" show? What happens if you disable it? Is there anything odd about the system? Is the CPU clock speed correctly identified? You could try the unstable tree -- it would certainly be interesting to know if there was a difference. The issue must be really quite specific to your machine or setup (e.g. the crystal is completely knackered) as otherwise lots of people would be complaining. Best, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Osma Suominen
2005-Jun-02 14:21 UTC
RE: [Xen-devel] wget and Zope crashes on post-2.0.6 -testing
On Thu, 2 Jun 2005, Ian Pratt wrote:> I can''t reproduce this wget crash, even running seti@home in the > background as you suggest. > I''m running this in dom0 on an SMP Xeon box. > > Are you running NTP on your system? If so, what does "echo peers | ntpq" > show? What happens if you disable it? > > Is there anything odd about the system? Is the CPU clock speed correctly > identified?Unfortunately the machine is not completely in my control. It is owned by another company and I only have access to dom1 and dom2, not dom0 (no other domains on the machine). But I will try to reproduce this on another machine. It might be a lot easier if there was a 2.0.6 demo CD, though... It was easy to cause this crash with the 2.0.5 demo CD. I succeeded on all 3 machines (most of them old) I tried it on. The recipe was in a previous post to the list. But maybe Xen has changed so much that it''s not relevant anymore. Anyway, the machine I''m now observing is a Pentium IV server with HyperThreading. ntpd is running in dom0. As far as I can tell the clock speed (3.0GHz) is correctly reported. There are two identical machines, and the problem occurs on both (and in both dom1 and dom2), so broken hardware is likely not to blame. AFAICT there is nothing odd with these machines; in fact the company owning them seems to make a good business out of renting out Xen domains to customers like me. However, since others aren''t complaining loudly, the problem could be something related to the specific workload I''m putting on the machines.> You could try the unstable tree -- it would certainly be interesting to > know if there was a difference. The issue must be really quite specific > to your machine or setup (e.g. the crystal is completely knackered) as > otherwise lots of people would be complaining.As I said the specific machine is not entirely in my control but I have the feeling I might be able to reproduce this on a spare machine, since it was so easy with 2.0.5. In that case I will try unstable as well. Anyway, thanks for your input. I will look at whether NTP is involved and do some further investigation. -Osma -- *** Osma Suominen / MB Concert Ky *** osma.suominen@mbconcert.fi *** _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2005-Jun-02 14:32 UTC
RE: [Xen-devel] wget and Zope crashes on post-2.0.6 -testing
> Unfortunately the machine is not completely in my control. It > is owned by another company and I only have access to dom1 > and dom2, not dom0 (no other domains on the machine). But I > will try to reproduce this on another machine. It might be a > lot easier if there was a 2.0.6 demo CD, though...Funny you should say that.... Please don''t all download it at once, but there''s a preview avilable at: http://www.cl.cam.ac.uk/Research/SRG/netos/xen/downloads/xendemo-2.0.6.i so> Anyway, thanks for your input. I will look at whether NTP is > involved and do some further investigation.If you''re running NTP in your local domain you should enable independent_wallclock e.g. echo 1 > /proc/sys/xen/independent_wallclock or put independent_wallclock=1 on your kernel command line. [NB: someone should document the kernel config option] I''ll wager that this is your problem. Hmm, that''s a pretty nasty failure mode. Without doing something gross and intercepting the adjtimex syscall there''s not a lot we can do about it. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Osma Suominen
2005-Jun-02 15:07 UTC
RE: [Xen-devel] wget and Zope crashes on post-2.0.6 -testing
On Thu, 2 Jun 2005, Ian Pratt wrote:> Funny you should say that.... > > Please don''t all download it at once, but there''s a preview avilable at: > http://www.cl.cam.ac.uk/Research/SRG/netos/xen/downloads/xendemo-2.0.6.isoWow! Thanks... I''ll look into that.> If you''re running NTP in your local domain you should enable > independent_wallclock > > e.g. echo 1 > /proc/sys/xen/independent_wallclock or put > independent_wallclock=1 on your kernel command line. [NB: someone should > document the kernel config option] > > I''ll wager that this is your problem. Hmm, that''s a pretty nasty failure > mode. Without doing something gross and intercepting the adjtimex > syscall there''s not a lot we can do about it.I''m not running NTP in domU, but it should be running in dom0, although it seems it''s not working since the clock is out of sync. I turned on independent_wallclock and was just about to report that it fixed the problem, and then it happened again. Twice. That is, SETI broke wget, even with independent_wallclock=1. Also, with "apt-get install ntpdate ntp-simple" and SETI running I get this interesting Perl error (which I''ve seen before, during high load): --clip-- debconf: Perl may be unconfigured (Global symbol "%priorities" requires explicit package name at /usr/share/perl5/Debconf/Priority.pm line 16. Compilation failed in require at /usr/share/perl5/Debconf/Config.pm line 7. BEGIN failed--compilation aborted at /usr/share/perl5/Debconf/Config.pm line 7. Compilation failed in require at /usr/share/perl5/Debconf/Log.pm line 8. Compilation failed in require at (eval 1) line 4. BEGIN failed--compilation aborted at (eval 1) line 4. ) -- aborting --clip-- And wget occasionally dies with "malloc: not enough memory", when the machine has 1 gig of free RAM (total 1,5G) plus 3G of swap. This is getting really weird... I installed ntp on the domU in question and the problem remains, with independent_wallclock=1, ntp running and the clock in sync with the world. -Osma -- *** Osma Suominen / MB Concert Ky *** osma.suominen@mbconcert.fi *** _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Osma Suominen
2005-Jun-03 09:04 UTC
RE: [Xen-devel] wget and Zope crashes on post-2.0.6 -testing
On Thu, 2 Jun 2005, Ian Pratt wrote:>> and dom2, not dom0 (no other domains on the machine). But I >> will try to reproduce this on another machine. It might be a >> lot easier if there was a 2.0.6 demo CD, though... > > Funny you should say that.... > > Please don''t all download it at once, but there''s a preview avilable at:I tried the demo CD and was able to reproduce this wget crash with it on an old Pentium III desktop PC. The recipe is basically the same as before, but I''ll repeat it here: 1. boot the 2.0.6 demo CD in text mode 2. ifup eth0 3. wget ftp://alien.ssl.berkeley.edu/pub/setiathome-3.08.i686-pc-linux-gnu.tar 4. untar, run, and background setiathome (with ^Z and bg) 5. run wget a few times (took some half a dozen attempts for me) When you''ve had wget crash, you can try some of the other tests in http://thread.gmane.org/gmane.comp.emulators.xen.devel/10628 Since this happens on a random PC with the demo CD, I''ll bet that this is not some obscure problem with the specific hardware or software installation but a real bug in Xen. -Osma -- *** Osma Suominen / MB Concert Ky *** osma.suominen@mbconcert.fi *** _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2005-Jun-08 17:44 UTC
Re: [Xen-devel] wget and Zope crashes on post-2.0.6 -testing
On 3 Jun 2005, at 10:04, Osma Suominen wrote:> When you''ve had wget crash, you can try some of the other tests in > http://thread.gmane.org/gmane.comp.emulators.xen.devel/10628 > > Since this happens on a random PC with the demo CD, I''ll bet that this > is not some obscure problem with the specific hardware or software > installation but a real bug in Xen.This bug should now be fixed in our xen-2.0.testing.bk repository. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2005-Jun-08 17:58 UTC
RE: [Xen-devel] wget and Zope crashes on post-2.0.6 -testing
> On 3 Jun 2005, at 10:04, Osma Suominen wrote: > > > When you''ve had wget crash, you can try some of the other tests in > > http://thread.gmane.org/gmane.comp.emulators.xen.devel/10628 > > > > Since this happens on a random PC with the demo CD, I''ll > bet that this > > is not some obscure problem with the specific hardware or software > > installation but a real bug in Xen. > > This bug should now be fixed in our xen-2.0.testing.bk repository.This deserves a bit more explanation, as it probably effects all vendor kernels based on Xen 2.0 (SuSE 9.3 Pro, Debian, demo CD, Gentoo, etc.) It does *not* effect the kernel we ship in our 2.0 source and binary tar balls, which is why its taken so long to pin down. It does *not* effect the unstable branch. The reason the bug is not present in our kernels is due to the kernel config: we enable CONFIG_MD_RAID5=y in our config which hides the bug, whereas most distros have this as a module. The root cause of the bug is that during the boot sequence Linux tests to see whether the processor has the fdiv bug. This involves doing some floating point opertions. Unfortunately, they are not wrapped in the kernel_fpu_begin()/end() calls that normally surround use of fp in the kernel. Native linux gets away with this because it happens so early in the boot process that no-one else can be using the fpu. However, on Xen this gets us into a bad state, which will come back to haunt us much later on, resulting in fpu state corruption in user processes. The fix in 2.0-testing is simply to ''wrap'' the fdiv test. The reason the bug is not present on unstable is that the fpu code had already been rejigged so that we were immune to this kind of problem as it had been identified as a potential fragility. Since this bug hadn''t been widely reported we probably won''t rush to release a 2.0.6a demo CD, but vendor kernel maintainers should definitely pick up the fix. Best, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Kurt Garloff
2005-Jun-08 20:59 UTC
Re: RE: [Xen-devel] wget and Zope crashes on post-2.0.6 -testing
Hi Ian, On Wed, Jun 08, 2005 at 06:58:51PM +0100, Ian Pratt wrote:> > On 3 Jun 2005, at 10:04, Osma Suominen wrote: > > > > > When you''ve had wget crash, you can try some of the other tests in > > > http://thread.gmane.org/gmane.comp.emulators.xen.devel/10628 > > > > > > Since this happens on a random PC with the demo CD, I''ll > > > bet that this > > > is not some obscure problem with the specific hardware or software > > > installation but a real bug in Xen. > > > > This bug should now be fixed in our xen-2.0.testing.bk repository. > > This deserves a bit more explanation, as it probably effects all vendor > kernels based on Xen 2.0 (SuSE 9.3 Pro, Debian, demo CD, Gentoo, etc.) > It does *not* effect the kernel we ship in our 2.0 source and binary tar > balls, which is why its taken so long to pin down. It does *not* effect > the unstable branch. > > The reason the bug is not present in our kernels is due to the kernel > config: we enable CONFIG_MD_RAID5=y in our config which hides the bug, > whereas most distros have this as a module. > > The root cause of the bug is that during the boot sequence Linux tests > to see whether the processor has the fdiv bug. This involves doing some > floating point opertions. Unfortunately, they are not wrapped in the > kernel_fpu_begin()/end() calls that normally surround use of fp in the > kernel. Native linux gets away with this because it happens so early in > the boot process that no-one else can be using the fpu. However, on Xen > this gets us into a bad state, which will come back to haunt us much > later on, resulting in fpu state corruption in user processes. The fix > in 2.0-testing is simply to ''wrap'' the fdiv test. > > The reason the bug is not present on unstable is that the fpu code had > already been rejigged so that we were immune to this kind of problem as > it had been identified as a potential fragility. > > Since this bug hadn''t been widely reported we probably won''t rush to > release a 2.0.6a demo CD, but vendor kernel maintainers should > definitely pick up the fix.Thanks for informing us! I observed that the first userspace process that uses the FPU will SIGFPE once. Afterwards everything runs just fine ... You description looks like it matches exactly the misbehaviour I''ve been seeing. Is attached patch the right way to fix this? -- Kurt Garloff, Director SUSE Labs, Novell Inc. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2005-Jun-08 21:19 UTC
RE: RE: [Xen-devel] wget and Zope crashes on post-2.0.6 -testing
> I observed that the first userspace process that uses the FPU > will SIGFPE once. Afterwards everything runs just fine ... > > You description looks like it matches exactly the > misbehaviour I''ve been seeing.Got any more critical bugs you''re not telling us about? :-)> Is attached patch the right way to fix this?I think that should work (with the obvious kernel_ prefix), but I''ve appeneded what we''ve gone for. Best, Ian --- linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/bugs.h 2005-06-08 22:08:52.000000000 +0100 +++ linux-2.6.11-xen0/include/asm-i386/bugs.h 2005-03-02 07:37:49.000000000 +0000 @@ -107,7 +107,6 @@ "fninit" : "=m" (*&boot_cpu_data.fdiv_bug) : "m" (*&x), "m" (*&y)); + stts(); if (boot_cpu_data.fdiv_bug) printk("Hmm, FPU with FDIV bug.\n"); } _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Kurt Garloff
2005-Jun-08 21:42 UTC
Re: RE: RE: [Xen-devel] wget and Zope crashes on post-2.0.6 -testing
Hi Ian, On Wed, Jun 08, 2005 at 10:19:35PM +0100, Ian Pratt wrote:> > > I observed that the first userspace process that uses the FPU > > will SIGFPE once. Afterwards everything runs just fine ... > > > > You description looks like it matches exactly the > > misbehaviour I''ve been seeing. > > Got any more critical bugs you''re not telling us about? :-)I wanted to hunt that one down myself ... obviously overestimating the amount of time and expertise I can devote to it :( OK, you want another one: Well, xenified SLES9 oopses on balloooning :-) But I hope that Christian, Kip, /me will track this one down soon.> > Is attached patch the right way to fix this? > > I think that should work (with the obvious kernel_ prefix), but I''ve > appeneded what we''ve gone for.Not having a CPU manual close to my desk: What do we achieve by setting bit 3 (TS) in CR0? Why does it help to get the FPU back to a sane state? Regards, -- Kurt Garloff, Director SUSE Labs, Novell Inc. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2005-Jun-08 21:55 UTC
Re: [Xen-devel] wget and Zope crashes on post-2.0.6 -testing
On 8 Jun 2005, at 22:42, Kurt Garloff wrote:>> I think that should work (with the obvious kernel_ prefix), but I''ve >> appeneded what we''ve gone for. > > Not having a CPU manual close to my desk: What do we achieve by setting > bit 3 (TS) in CR0? Why does it help to get the FPU back to a sane > state?When set it causes a fault whenever the FPU is accessed. We use it to lazily initialise the FPU for the currently running process. At context-switch time we look at the process we are descheduling and, if it hasn;t used the FPU in its time slice, we don;t save FPU state and we don;t set the TS bit (because we assume it must be already set). The last point is where we can fall down: if the TS bit in fact *isn;t* set, then we are screwed for all time. The kernel will never realise a process is using the FPU because we will never take the TS fault, because the TS bit is clear. Thus state doesn;t get saved/restored during context switch and the TS bit never gets set. So its a self perpetuating state once you''re in it. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Vincent Hanquez
2005-Jun-09 08:18 UTC
Re: RE: [Xen-devel] wget and Zope crashes on post-2.0.6 -testing
On Wed, Jun 08, 2005 at 10:59:00PM +0200, Kurt Garloff wrote:> - > + > + fpu_begin(); > /* Test for the divl bug.. */ > __asm__("fninit\n\t" > "fldl %1\n\t" > "fdivl %2\n\t" > @@ -108,8 +109,9 @@ static void __init check_fpu(void) > : "=m" (*&boot_cpu_data.fdiv_bug) > : "m" (*&x), "m" (*&y)); > if (boot_cpu_data.fdiv_bug) > printk("Hmm, FPU with FDIV bug.\n"); > + fpu_end(); > }This would works too, but I choose to just reenable the TS flags, because at this boot point, we don''t even care about what is already in the fpu when entering the inline asm, we just want to take the next FPU fault. kernel_fpu_begin/end() is more for saving FPU userspace context for doing something that need FPU in kernelmode. That''s probably why this test has not been wrapped on vanilla kernel. cheers, -- Vincent Hanquez _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Robbie Dinn
2005-Jun-10 19:52 UTC
Re: [Xen-devel] wget and Zope crashes on post-2.0.6 -testing
Hi all Keir Fraser wrote:> > When set it causes a fault whenever the FPU is accessed. We use it to > lazily initialise the FPU for the currently running process. At > context-switch time we look at the process we are descheduling and, if > it hasn;t used the FPU in its time slice, we don;t save FPU state and we > don;t set the TS bit (because we assume it must be already set). > > The last point is where we can fall down: if the TS bit in fact *isn;t* > set, then we are screwed for all time. The kernel will never realise a > process is using the FPU because we will never take the TS fault, > because the TS bit is clear. Thus state doesn;t get saved/restored > during context switch and the TS bit never gets set. So its a self > perpetuating state once you''re in it.I have an end user question rather than a developers Q. Say I have an xen machine with several domains, some with kernels that have the FPU bug fix and some without. Can a domain with the buggy kernel upset a domain with a bug free kernel? Or does this just affect processes within one domain? I might want to be a bit more hasty in upgrading all the kernels if a buggy kernel/domain can upset a good kernel/domain. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2005-Jun-10 20:01 UTC
RE: [Xen-devel] wget and Zope crashes on post-2.0.6 -testing
> > The last point is where we can fall down: if the TS bit in fact > > *isn;t* set, then we are screwed for all time. The kernel > will never > > realise a process is using the FPU because we will never > take the TS > > fault, because the TS bit is clear. Thus state doesn;t get > > saved/restored during context switch and the TS bit never > gets set. So > > its a self perpetuating state once you''re in it. > > Say I have an xen machine with several domains, some with > kernels that have the FPU bug fix and some without. Can a > domain with the buggy kernel upset a domain with a bug free kernel? > Or does this just affect processes within one domain?It just affects the one domain. Best, Ian> I might want to be a bit more hasty in upgrading all the > kernels if a buggy kernel/domain can upset a good kernel/domain. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel