I''ve been running Pallas MPI benchmarks with several configurations, I just ran a test that errored out. I''ve run the benchmark successfully on Xen0 only (four nodes) and on XenU only (four nodes) with no Xen related errors and no benchmark errors. This time I ran it with two XenU''s on each of the four nodes, each participating in two separate, simultaneous benchmark runs (two groups of four XenU''s) and all bridged to the cluster LAN. Only one physical node had a problem (they are identical builds of Xen and XenLinux 2.4.27, last cset 1.1362, 2004-10-04 15:55:47+01:00). There was a group of messages late August with the same time went backwards errors, but this is a recent build. One thing is also that on this node Xen chose to host both guests on CPU 1 (and I know that at the exact moment of failure Xen1 was interacting with the only other one not to spread out the guests (it actually had all three Xen0,Xen1,Xen2 on CPU 0)). I have no clue if any of this information is helpful :-). (I am attempting another run with the same configuration right now) xm dmesg: (XEN) APIC error on CPU0: 00(02) (XEN) APIC error on CPU1: 00(02) (XEN) APIC error on CPU1: 02(02) (XEN) APIC error on CPU0: 02(02) (XEN) APIC error on CPU1: 02(01) (XEN) APIC error on CPU0: 02(02) Xen0 dmesg, just two error messages: Timer ISR: Time went backwards: -59799000 Timer ISR: Time went backwards: -48699000 (these filled the whole kernel ring buffer:) Xen1 dmesg, attached, time went backwards many times Xen2 dmesg, attached, time went backwards many times benchmark error, Xen1, presumably at the same time as Xen2.. (though on a different benchmark, the two groups of four actually lost sync after a while, I''m using the default CPU scheduler. I chalk that up to the weird cpu pinning that Xen/Xend chose for two of the physical nodes, I am going to pin those myself in the future) p3_827: p4_error: net_recv read: probable EOF on socket: 1 p1_777: p4_error: net_recv read: probable EOF on socket: 1 benchmark error, Xen2 p2_821: (347.806618) net_recv failed for fd = 4 p2_821: p4_error: net_recv read, errno = : 104 p3_769: p4_error: net_recv read: probable EOF on socket: 1 p1_766: (402.558327) net_recv failed for fd = 8 p1_766: p4_error: net_recv read, errno = : 104
> One thing is also that on this node Xen chose to host both guests on CPU > 1 (and I know that at the exact moment of failure Xen1 was interacting > with the only other one not to spread out the guests (it actually had > all three Xen0,Xen1,Xen2 on CPU 0)).The code that choses the initial CPU for a domain is an embarrassment, and currently makes no attempt to distribute them evenly. I''ll check in something that at least chooses the CPU with the smallest number of domains. Proper load balancing will require someone to write the simple little daemon discussed earlier this week on the list.> xm dmesg: > > (XEN) APIC error on CPU0: 00(02) > (XEN) APIC error on CPU1: 00(02)Odd. Probably not terminal, though.> Xen0 dmesg, just two error messages: > Timer ISR: Time went backwards: -59799000 > Timer ISR: Time went backwards: -48699000Interesting. So both both the xenU domains are reporting a 14s skip, and dom0 is reporting a larger skip (though this may be a different incident). Are you running ntpdate ot xntpd in domain0? What about the other domains? (I presume that you haven''t requested independent_wallclock for them?) It might be interesting to modify the printk in arch/xen/i386/kernel/time.c to also print the variables that go into the delta calculation e.g.: printk("Timer ISR: Time went backwards: %lld %lld %ld %lld\n", delta, shadow_system_time, (cur_timer->get_offset() * NSEC_PER_USEC), processed_system_time); I presume it''s shadow_system_time that''s jumping, but it would be useful if you could add debuging to prove this. Ian ------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
If you''re getting APIC errors then all other bets are off, quite frankly. Do you get any of these messages on a native Linux 2.4 kernel? -- Keir> > xm dmesg: > > > > (XEN) APIC error on CPU0: 00(02) > > (XEN) APIC error on CPU1: 00(02) > > Odd. Probably not terminal, though. > > > Xen0 dmesg, just two error messages: > > Timer ISR: Time went backwards: -59799000 > > Timer ISR: Time went backwards: -48699000 > > Interesting. So both both the xenU domains are reporting a 14s > skip, and dom0 is reporting a larger skip (though this may be a > different incident).------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Wed, 13 Oct 2004 08:25:36 +0100 Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:> > If you''re getting APIC errors then all other bets are off, quite > frankly. Do you get any of these messages on a native Linux 2.4 > kernel?I''ve been running the benchmark just now on native Linux, no errors, APIC or otherwise. I also had originally run the benchmark on native linux with the ''nosmp'' flag (to compare more readily to Xen0) and also no errors there. I re-ran the benchmark last night with the 8 xenU/4 physical configuration and repeated the problem on both the XenU''s on the same physical node (it is the one the jobs are started from). *Although* xen0 did not exhibit the time jump. To answer Ian''s question, I am running ntpd on each of the xen0''s, perhaps that was the problem with domain 0 jumping (but I thought ntpd only makes minuscule steps, but ntpdate is the one that makes bigger jumps) Then I recompiled with Ian''s suggested debug patch. It went smoothly actually, no time errors, but still APIC errors in xm dmesg (definitely new, since I had rebooted). ??? I am not really a super systems person (is that obvious yet? :-), but don''t APIC errors have to do with SMP a lot of the time? And what does APIC have to do with XenU timing issues? (does one of the new, extra IRQs go to XenU directly?) Can I boot xen with the noapic flag? Thanks for any input, I am going to continue running different configurations (as I originally planned) and I''ll see if anything else happens.> > -- Keir > > > > xm dmesg: > > > > > > (XEN) APIC error on CPU0: 00(02) > > > (XEN) APIC error on CPU1: 00(02) > > > > Odd. Probably not terminal, though. > > > > > Xen0 dmesg, just two error messages: > > > Timer ISR: Time went backwards: -59799000 > > > Timer ISR: Time went backwards: -48699000 > > > > Interesting. So both both the xenU domains are reporting a 14s > > skip, and dom0 is reporting a larger skip (though this may be a > > different incident). > > > ------------------------------------------------------- > This SF.net email is sponsored by: IT Product Guide on ITManagersJournal > Use IT products in your business? Tell us what you think of them. Give us > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more > http://productguide.itmanagersjournal.com/guidepromo.tmpl > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel >------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Wed, 2004-10-13 at 15:00, Tim Freeman wrote:> On Wed, 13 Oct 2004 08:25:36 +0100 > Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote: > > > > > If you''re getting APIC errors then all other bets are off, quite > > frankly. Do you get any of these messages on a native Linux 2.4 > > kernel? > > I''ve been running the benchmark just now on native Linux, no errors, > APIC or otherwise. I also had originally run the benchmark on native > linux with the ''nosmp'' flag (to compare more readily to Xen0) and also > no errors there. > > I re-ran the benchmark last night with the 8 xenU/4 physical > configuration and repeated the problem on both the XenU''s on the same > physical node (it is the one the jobs are started from). *Although* > xen0 did not exhibit the time jump. To answer Ian''s question, I am > running ntpd on each of the xen0''s, perhaps that was the problem with > domain 0 jumping (but I thought ntpd only makes minuscule steps, but > ntpdate is the one that makes bigger jumps)ntp can make large corrections, however the standard setup (default) is to make ~120ms changes per "tick" on checking. Once ntpd is synced up it shouldn''t need to make more than ~2ms of change per tick unless your hardware clock is totally farked (I''d call 1ms in 60 seconds beyond farked to be honest). :)> > Then I recompiled with Ian''s suggested debug patch. It went smoothly > actually, no time errors, but still APIC errors in xm dmesg (definitely > new, since I had rebooted). > > ??? > > I am not really a super systems person (is that obvious yet? :-), but > don''t APIC errors have to do with SMP a lot of the time? And what does > APIC have to do with XenU timing issues? (does one of the new, extra > IRQs go to XenU directly?) Can I boot xen with the noapic flag?Originally from what I understand, APIC was inplemented to allow SMP, and as such was only used for SMP. Some time around the middle of the 2.4 series I noticed that you could select UP-APIC when SMP was disabled. Not certain exactly when that ability was enabled though.> > > Thanks for any input, I am going to continue running different > configurations (as I originally planned) and I''ll see if anything else > happens. > > > > > -- Keir > > > > > > xm dmesg: > > > > > > > > (XEN) APIC error on CPU0: 00(02) > > > > (XEN) APIC error on CPU1: 00(02) > > > > > > Odd. Probably not terminal, though. > > > > > > > Xen0 dmesg, just two error messages: > > > > Timer ISR: Time went backwards: -59799000 > > > > Timer ISR: Time went backwards: -48699000 > > > > > > Interesting. So both both the xenU domains are reporting a 14s > > > skip, and dom0 is reporting a larger skip (though this may be a > > > different incident). > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: IT Product Guide on ITManagersJournal > > Use IT products in your business? Tell us what you think of them. Give us > > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more > > http://productguide.itmanagersjournal.com/guidepromo.tmpl > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/xen-devel > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: IT Product Guide on ITManagersJournal > Use IT products in your business? Tell us what you think of them. Give us > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more > http://productguide.itmanagersjournal.com/guidepromo.tmpl > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
The time going backwards is no longer a problem. I believe it was particular benchmarks that required too many big messages which maxed out the RAM. The masters had no swapspace and it bombed on that, now they don''t. I feel dumb that it took five runs before I tried adding swap as the EOF is a big clue. I posted the PMB MPI results here (no interpretation): http://www-unix.mcs.anl.gov/~tfreeman/envelope/index.html This is only a preliminary "back of the envelope" set of runs that I only had time to think about off and on this week. This isn''t part of my globus-VM project, it was just a "see what happens" experiment since Xen was installed anyhow. I have no time left for it but there they are, whatever they''re worth. I do have some thoughts on the runs in Part I. In general, the raw and Xen results converged as the message size got bigger. This is most likely due to the fact that domain0 needs to bridge guests'' packets. As they got bigger, more could be moved at once and so was faster out the box per byte. Right? btw, for reference, before trying the swap idea, I tried moving the masters to a new physical node, it bombed, and the XenU''s reported a new error: __alloc_pages: 0-order allocation failed (gfp=0x1d2/0) VM: killing process PMB-MPI1 And that run, the time only went backwards on Xen0. I didn''t add the struct to the debug, it wouldn''t compile and I was too much in a rush to figure out the intended extra arg, sorry. printk("Timer ISR: Time went backwards: %lld %lld %lld\n", delta, shadow_system_time, processed_system_time); Xen0, just two again: Timer ISR: Time went backwards: -59842000 7226230000000 7226290000000 Timer ISR: Time went backwards: -49988000 7226240000000 7226290000000 Thanks for the help before! I really appreciate it, but my fault here -- but these errors are strange still, aren''t they? ------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
> btw, for reference, before trying the swap idea, I tried moving the > masters to a new physical node, it bombed, and the XenU''s reported a new > error: > > __alloc_pages: 0-order allocation failed (gfp=0x1d2/0) > VM: killing process PMB-MPI1This is just the guest kernel running out of memory, and the out-of-memory killer selecting a victim. Are you sure this VM had swap configured?> printk("Timer ISR: Time went backwards: %lld %lld %lld\n", delta, > shadow_system_time, processed_system_time); > > Xen0, just two again: > Timer ISR: Time went backwards: -59842000 7226230000000 7226290000000 > Timer ISR: Time went backwards: -49988000 7226240000000 7226290000000That''s useful, thanks. I''ll take a look at the code. Ian ------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Sun, 17 Oct 2004 10:30:12 +0100 Ian Pratt <Ian.Pratt@cl.cam.ac.uk> wrote:> > > btw, for reference, before trying the swap idea, I tried moving the > > masters to a new physical node, it bombed, and the XenU''s reported a new > > error: > > > > __alloc_pages: 0-order allocation failed (gfp=0x1d2/0) > > VM: killing process PMB-MPI1 > > This is just the guest kernel running out of memory, and the > out-of-memory killer selecting a victim. Are you sure this VM had > swap configured?Swap is not configured for the guest when this happens, I was just reporting the error along the way in case it was useful. When swap is configured there are no problems.> > > printk("Timer ISR: Time went backwards: %lld %lld %lld\n", delta, > > shadow_system_time, processed_system_time); > > > > Xen0, just two again: > > Timer ISR: Time went backwards: -59842000 7226230000000 7226290000000 > > Timer ISR: Time went backwards: -49988000 7226240000000 7226290000000 > > That''s useful, thanks. I''ll take a look at the code. > > Ian >------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel