thr3ads.net - Xen devel - [Xen-devel] time still going backwards [Oct 2004]

If this information is useful, please help other people find it:
Share via:

Tim Freeman

2004-Oct-13 00:09 UTC

[Xen-devel] time still going backwards

I''ve been running Pallas MPI benchmarks with several configurations, I
just ran a test that errored out.  I''ve run the benchmark successfully
on Xen0 only (four nodes) and on XenU only (four nodes) with no Xen
related errors and no benchmark errors.

This time I ran it with two XenU''s on each of the four nodes, each
participating in two separate, simultaneous benchmark runs (two groups
of four XenU''s) and all bridged to the cluster LAN.  Only one physical
node had a problem (they are identical builds of Xen and XenLinux
2.4.27, last cset 1.1362, 2004-10-04 15:55:47+01:00). There was a group
of messages late August with the same time went backwards errors, but
this is a recent build.

One thing is also that on this node Xen chose to host both guests on CPU
1 (and I know that at the exact moment of failure Xen1 was interacting
with the only other one not to spread out the guests (it actually had
all three Xen0,Xen1,Xen2 on CPU 0)).  

I have no clue if any of this information is helpful :-).  

(I am attempting another run with the same configuration right now)



xm dmesg:

(XEN) APIC error on CPU0: 00(02)
(XEN) APIC error on CPU1: 00(02)
(XEN) APIC error on CPU1: 02(02)
(XEN) APIC error on CPU0: 02(02)
(XEN) APIC error on CPU1: 02(01)
(XEN) APIC error on CPU0: 02(02)

Xen0 dmesg, just two error messages:
Timer ISR: Time went backwards: -59799000
Timer ISR: Time went backwards: -48699000

(these filled the whole kernel ring buffer:)
Xen1 dmesg, attached, time went backwards many times
Xen2 dmesg, attached, time went backwards many times 

benchmark error, Xen1, presumably at the same time as Xen2.. (though on 
a different benchmark, the two groups of four actually lost sync after a
while, I''m using the default CPU scheduler. I chalk that up to the
weird cpu pinning that Xen/Xend chose for two of the physical nodes, I am
going to pin those myself in the future)

p3_827:  p4_error: net_recv read:  probable EOF on socket: 1
p1_777:  p4_error: net_recv read:  probable EOF on socket: 1

benchmark error, Xen2
p2_821: (347.806618) net_recv failed for fd = 4
p2_821:  p4_error: net_recv read, errno = : 104
p3_769:  p4_error: net_recv read:  probable EOF on socket: 1
p1_766: (402.558327) net_recv failed for fd = 8
p1_766:  p4_error: net_recv read, errno = : 104

Ian Pratt

2004-Oct-13 02:18 UTC

head link

Re: [Xen-devel] time still going backwards

> One thing is also that on this node Xen chose to host both guests on CPU
> 1 (and I know that at the exact moment of failure Xen1 was interacting
> with the only other one not to spread out the guests (it actually had
> all three Xen0,Xen1,Xen2 on CPU 0)).  
The code that choses the initial CPU for a domain is an
embarrassment, and currently makes no attempt to distribute them
evenly. I''ll check in something that at least chooses the CPU
with the smallest number of domains. Proper load balancing will
require someone to write the simple little daemon discussed
earlier this week on the list.
> xm dmesg:
> 
> (XEN) APIC error on CPU0: 00(02)
> (XEN) APIC error on CPU1: 00(02)
Odd. Probably not terminal, though.
> Xen0 dmesg, just two error messages:
> Timer ISR: Time went backwards: -59799000
> Timer ISR: Time went backwards: -48699000
Interesting. So both both the xenU domains are reporting a 14s
skip, and dom0 is reporting a larger skip (though this may be a
different incident).

Are you running ntpdate ot xntpd in domain0? What about the other
domains? (I presume that you haven''t requested
independent_wallclock for them?)

It might be interesting to modify the printk in
arch/xen/i386/kernel/time.c to also print the variables that go
into the delta calculation e.g.:

printk("Timer ISR: Time went backwards: %lld %lld %ld %lld\n", delta,
shadow_system_time, (cur_timer->get_offset() * NSEC_PER_USEC), 
processed_system_time);

I presume it''s shadow_system_time that''s jumping, but it would
be
useful if you could add debuging to prove this.

Ian


-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

Keir Fraser

2004-Oct-13 07:25 UTC

head link

Re: [Xen-devel] time still going backwards

If you''re getting APIC errors then all other bets are off, quite
frankly. Do you get any of these messages on a native Linux 2.4
kernel?

 -- Keir
> > xm dmesg:
> > 
> > (XEN) APIC error on CPU0: 00(02)
> > (XEN) APIC error on CPU1: 00(02)
> 
> Odd. Probably not terminal, though.
> 
> > Xen0 dmesg, just two error messages:
> > Timer ISR: Time went backwards: -59799000
> > Timer ISR: Time went backwards: -48699000
> 
> Interesting. So both both the xenU domains are reporting a 14s
> skip, and dom0 is reporting a larger skip (though this may be a
> different incident).

-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

Tim Freeman

2004-Oct-13 20:00 UTC

head link

Re: [Xen-devel] time still going backwards

On Wed, 13 Oct 2004 08:25:36 +0100
Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:
> 
> If you''re getting APIC errors then all other bets are off, quite
> frankly. Do you get any of these messages on a native Linux 2.4
> kernel?
I''ve been running the benchmark just now on native Linux, no errors,
APIC or otherwise.  I also had originally run the benchmark on native
linux with the ''nosmp'' flag (to compare more readily to Xen0)
and also
no errors there.

I re-ran the benchmark last night with the 8 xenU/4 physical
configuration and repeated the problem on both the XenU''s on the same
physical node (it is the one the jobs are started from).  *Although*
xen0 did not exhibit the time jump.  To answer Ian''s question, I am
running ntpd on each of the xen0''s, perhaps that was the problem with
domain 0 jumping (but I thought ntpd only makes minuscule steps, but
ntpdate is the one that makes bigger jumps)

Then I recompiled with Ian''s suggested debug patch.  It went smoothly
actually, no time errors, but still APIC errors in xm dmesg (definitely
new, since I had rebooted).

???

I am not really a super systems person (is that obvious yet? :-), but
don''t APIC errors have to do with SMP a lot of the time?  And what does
APIC have to do with XenU timing issues? (does one of the new, extra
IRQs go to XenU directly?)  Can I boot xen with the noapic flag? 

Thanks for any input, I am going to continue running different
configurations (as I originally planned) and I''ll see if anything else
happens.
> 
>  -- Keir
> 
> > > xm dmesg:
> > > 
> > > (XEN) APIC error on CPU0: 00(02)
> > > (XEN) APIC error on CPU1: 00(02)
> > 
> > Odd. Probably not terminal, though.
> > 
> > > Xen0 dmesg, just two error messages:
> > > Timer ISR: Time went backwards: -59799000
> > > Timer ISR: Time went backwards: -48699000
> > 
> > Interesting. So both both the xenU domains are reporting a 14s
> > skip, and dom0 is reporting a larger skip (though this may be a
> > different incident).
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
> Use IT products in your business? Tell us what you think of them. Give us
> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
> http://productguide.itmanagersjournal.com/guidepromo.tmpl
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xen-devel
> 

-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

Brian Wolfe

2004-Oct-14 03:13 UTC

head link

Re: [Xen-devel] time still going backwards

On Wed, 2004-10-13 at 15:00, Tim Freeman wrote:> On Wed, 13 Oct 2004 08:25:36 +0100
> Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:
> 
> > 
> > If you''re getting APIC errors then all other bets are off,
quite
> > frankly. Do you get any of these messages on a native Linux 2.4
> > kernel?
> 
> I''ve been running the benchmark just now on native Linux, no
errors,
> APIC or otherwise.  I also had originally run the benchmark on native
> linux with the ''nosmp'' flag (to compare more readily to
Xen0) and also
> no errors there.
> 
> I re-ran the benchmark last night with the 8 xenU/4 physical
> configuration and repeated the problem on both the XenU''s on the
same
> physical node (it is the one the jobs are started from).  *Although*
> xen0 did not exhibit the time jump.  To answer Ian''s question, I
am
> running ntpd on each of the xen0''s, perhaps that was the problem
with
> domain 0 jumping (but I thought ntpd only makes minuscule steps, but
> ntpdate is the one that makes bigger jumps)
ntp can make large corrections, however the standard setup (default) is
to make ~120ms changes per "tick" on checking. Once ntpd is synced up
it
shouldn''t need to make more than ~2ms of change per tick unless your
hardware clock is totally farked (I''d call 1ms in 60 seconds beyond
farked to be honest). :)
> 
> Then I recompiled with Ian''s suggested debug patch.  It went
smoothly
> actually, no time errors, but still APIC errors in xm dmesg (definitely
> new, since I had rebooted).
> 
> ???
> 
> I am not really a super systems person (is that obvious yet? :-), but
> don''t APIC errors have to do with SMP a lot of the time?  And what
does
> APIC have to do with XenU timing issues? (does one of the new, extra
> IRQs go to XenU directly?)  Can I boot xen with the noapic flag?
Originally from what I understand, APIC was inplemented to allow SMP,
and as such was only used for SMP. Some time around the middle of the
2.4 series I noticed that you could select UP-APIC when SMP was
disabled. Not certain exactly when that ability was enabled though.
>  
> 
> Thanks for any input, I am going to continue running different
> configurations (as I originally planned) and I''ll see if anything
else
> happens.
> 
> > 
> >  -- Keir
> > 
> > > > xm dmesg:
> > > > 
> > > > (XEN) APIC error on CPU0: 00(02)
> > > > (XEN) APIC error on CPU1: 00(02)
> > > 
> > > Odd. Probably not terminal, though.
> > > 
> > > > Xen0 dmesg, just two error messages:
> > > > Timer ISR: Time went backwards: -59799000
> > > > Timer ISR: Time went backwards: -48699000
> > > 
> > > Interesting. So both both the xenU domains are reporting a 14s
> > > skip, and dom0 is reporting a larger skip (though this may be a
> > > different incident).
> > 
> > 
> > -------------------------------------------------------
> > This SF.net email is sponsored by: IT Product Guide on
ITManagersJournal
> > Use IT products in your business? Tell us what you think of them. Give
us
> > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out
more
> > http://productguide.itmanagersjournal.com/guidepromo.tmpl
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/xen-devel
> > 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
> Use IT products in your business? Tell us what you think of them. Give us
> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
> http://productguide.itmanagersjournal.com/guidepromo.tmpl
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xen-devel


-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

Tim Freeman

2004-Oct-17 02:31 UTC

head link

Re: [Xen-devel] [FIXED] time still going backwards

The time going backwards is no longer a problem.  I believe it was
particular benchmarks that required too many big messages which maxed
out the RAM.  The masters had no swapspace and it bombed on that, now
they don''t.  I feel dumb that it took five runs before I tried adding
swap as the EOF is a big clue.

I posted the PMB MPI results here (no interpretation):
http://www-unix.mcs.anl.gov/~tfreeman/envelope/index.html

This is only a preliminary "back of the envelope" set of runs that I
only had time to think about off and on this week.  This isn''t part of
my globus-VM project, it was just a "see what happens" experiment
since
Xen was installed anyhow.  I have no time left for it but there they
are, whatever they''re worth.

I do have some thoughts on the runs in Part I.  In general, the raw and
Xen results converged as the message size got bigger.  This is most
likely due to the fact that domain0 needs to bridge guests'' packets. 
As
they got bigger, more could be moved at once and so was faster out the
box per byte.  Right? 



btw, for reference, before trying the swap idea, I tried moving the
masters to a new physical node, it bombed, and the XenU''s reported a
new
error:

__alloc_pages: 0-order allocation failed (gfp=0x1d2/0) 
VM: killing process PMB-MPI1


And that run, the time only went backwards on Xen0. 

I didn''t add the struct to the debug, it wouldn''t compile and
I was too
much in a rush to figure out the intended extra arg, sorry. 

printk("Timer ISR: Time went backwards: %lld %lld %lld\n", delta,
shadow_system_time, processed_system_time);

Xen0, just two again:
Timer ISR: Time went backwards: -59842000 7226230000000 7226290000000
Timer ISR: Time went backwards: -49988000 7226240000000 7226290000000



Thanks for the help before!  I really appreciate it, but my fault here
-- but these errors are strange still, aren''t they?










-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

Ian Pratt

2004-Oct-17 09:30 UTC

head link

Re: [Xen-devel] [FIXED] time still going backwards

> btw, for reference, before trying the swap idea, I tried moving the
> masters to a new physical node, it bombed, and the XenU''s reported
a new
> error:
> 
> __alloc_pages: 0-order allocation failed (gfp=0x1d2/0) 
> VM: killing process PMB-MPI1
This is just the guest kernel running out of memory, and the
out-of-memory killer selecting a victim. Are you sure this VM had
swap configured?
 > printk("Timer ISR: Time went backwards: %lld %lld %lld\n", delta,
> shadow_system_time, processed_system_time);
> 
> Xen0, just two again:
> Timer ISR: Time went backwards: -59842000 7226230000000 7226290000000
> Timer ISR: Time went backwards: -49988000 7226240000000 7226290000000
That''s useful, thanks. I''ll take a look at the code.

Ian



-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

Tim Freeman

2004-Oct-17 15:32 UTC

head link

Re: [Xen-devel] [FIXED] time still going backwards

On Sun, 17 Oct 2004 10:30:12 +0100
Ian Pratt <Ian.Pratt@cl.cam.ac.uk> wrote:
> 
> > btw, for reference, before trying the swap idea, I tried moving the
> > masters to a new physical node, it bombed, and the XenU''s
reported a new
> > error:
> > 
> > __alloc_pages: 0-order allocation failed (gfp=0x1d2/0) 
> > VM: killing process PMB-MPI1
> 
> This is just the guest kernel running out of memory, and the
> out-of-memory killer selecting a victim. Are you sure this VM had
> swap configured?
Swap is not configured for the guest when this happens, I was just
reporting the error along the way in case it was useful.  When swap is
configured there are no problems.
>  
> > printk("Timer ISR: Time went backwards: %lld %lld %lld\n",
delta,
> > shadow_system_time, processed_system_time);
> > 
> > Xen0, just two again:
> > Timer ISR: Time went backwards: -59842000 7226230000000 7226290000000
> > Timer ISR: Time went backwards: -49988000 7226240000000 7226290000000
> 
> That''s useful, thanks. I''ll take a look at the code.
> 
> Ian
> 

-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

Xen devel - Oct 2004 - time still going backwards

[Xen-devel] time still going backwards

Re: [Xen-devel] time still going backwards

Re: [Xen-devel] time still going backwards

Re: [Xen-devel] time still going backwards

Re: [Xen-devel] time still going backwards

Re: [Xen-devel] [FIXED] time still going backwards

Re: [Xen-devel] [FIXED] time still going backwards

Re: [Xen-devel] [FIXED] time still going backwards