My "disk-stress" tests have the following status: 1. SLES 9.3 using VNC as display has run for over 23 virtual hours, some 40 or so hours since I set off the test without any failures. There''s one difference between this test and previous ones: I''ve disabled the blanking of the screen - there appears to be a problem waking the screen after some time, not sure why that would be. 2. "Simple-guest" fails to restore on the second restore, ending up with the guest "killed". Scanning the xend.log, I find "error zeroing magic pages". Looking further down that path, it seems like it''s failing to do "xc_map_foreign_range"... I''m adding some debug output to try to determine where it goes wrong here. I''m going to concentrate on this to see if I can track down where the simple-guest fails first, then look at why blanking the screen makes things go wrong. I will also try with SDL once I''ve rebooted the system with a newer Xen build. -- Mats _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi, At 12:58 +0200 on 25 Apr (1177505885), Petersson, Mats wrote:> My "disk-stress" tests have the following status: > 1. SLES 9.3 using VNC as display has run for over 23 virtual hours, some > 40 or so hours since I set off the test without any failures.That''s great!> There''s > one difference between this test and previous ones: I''ve disabled the > blanking of the screen - there appears to be a problem waking the screen > after some time, not sure why that would be.What are the symptoms there? Is the guest still alive? Is qemu-dm alive? Does it respond on the network, and just have a wedged console? (Might it be the keyboard + mouse that have got wedged?)> 2. "Simple-guest" fails to restore on the second restore, ending up with > the guest "killed". Scanning the xend.log, I find "error zeroing magic > pages". Looking further down that path, it seems like it''s failing to do > "xc_map_foreign_range"... I''m adding some debug output to try to > determine where it goes wrong here.Strange. Are you doing anything wierd with the ioreq or xenstore pages in the simple guest? Their PFNs should have been maintained across the first save/restore cycle, and they were mappable the first time... Cheers, Tim. -- Tim Deegan <Tim.Deegan@xensource.com>, XenSource UK Limited Registered office c/o EC2Y 5EB, UK; company number 05334508 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> -----Original Message----- > From: Tim Deegan [mailto:Tim.Deegan@xensource.com] > Sent: 25 April 2007 15:09 > To: Petersson, Mats > Cc: xen-devel@lists.xensource.com; Woller, Thomas > Subject: Re: [Xen-devel] HVM Save/Restore status. > > Hi, > > At 12:58 +0200 on 25 Apr (1177505885), Petersson, Mats wrote: > > My "disk-stress" tests have the following status: > > 1. SLES 9.3 using VNC as display has run for over 23 > virtual hours, some > > 40 or so hours since I set off the test without any failures. > > That''s great! > > > There''s > > one difference between this test and previous ones: I''ve > disabled the > > blanking of the screen - there appears to be a problem > waking the screen > > after some time, not sure why that would be. > > What are the symptoms there? Is the guest still alive? Is qemu-dm > alive? Does it respond on the network, and just have a > wedged console? > (Might it be the keyboard + mouse that have got wedged?)Good question. It turns out (from an attempt to stop the guest nicely when I was going to reboot to have a new Linux-kernel with debug code in it) that although the guest is still running, I have actually lost at least: - Network. I can''t ping the guest or SSH to the guest on the IP address it used to be when it first got an address from DHCP - presumably, the IP address shouldn''t change (it doesn''t on other machines that get IP address from the same DHCP server). - Keyboard. Pressing for example CTRL-C to stop the running application doesn''t work. No other keys appear to have any effect either. It''s unclear to me if any other operations are affected or not. [Time seemed a bit funny too, but that may be my app - I haven''t debugged that yet. It kept cycling around a 2-3 second range around 23h14m18-20s (or some such), where the time comes from "time()" - so perhaps there''s something wrong in the "gettimeofday" functionality too.]> > > 2. "Simple-guest" fails to restore on the second restore, > ending up with > > the guest "killed". Scanning the xend.log, I find "error > zeroing magic > > pages". Looking further down that path, it seems like it''s > failing to do > > "xc_map_foreign_range"... I''m adding some debug output to try to > > determine where it goes wrong here. > > Strange. Are you doing anything wierd with the ioreq or > xenstore pages > in the simple guest? Their PFNs should have been maintained > across the > first save/restore cycle, and they were mappable the first time...I''m trying to see what fails and where by printing something at every failure point. So far I''ve tracked it down to somewhere inside the function direct_remap_pfn_range... Not sure where in this function it goes wrong or where in any of the called functions. As far as I can see, there''s not many things that can go wrong there... -- Mats> > Cheers, > > Tim. > > -- > Tim Deegan <Tim.Deegan@xensource.com>, XenSource UK Limited > Registered office c/o EC2Y 5EB, UK; company number 05334508 > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> -----Original Message----- > From: xen-devel-bounces@lists.xensource.com > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of > Petersson, Mats > Sent: 25 April 2007 15:25 > To: Tim Deegan > Cc: xen-devel@lists.xensource.com; Woller, Thomas > Subject: RE: [Xen-devel] HVM Save/Restore status. > > > > > -----Original Message----- > > From: Tim Deegan [mailto:Tim.Deegan@xensource.com] > > Sent: 25 April 2007 15:09 > > To: Petersson, Mats > > Cc: xen-devel@lists.xensource.com; Woller, Thomas > > Subject: Re: [Xen-devel] HVM Save/Restore status. > > > > Hi, > > > > At 12:58 +0200 on 25 Apr (1177505885), Petersson, Mats wrote: > > > My "disk-stress" tests have the following status: > > > 1. SLES 9.3 using VNC as display has run for over 23 > > virtual hours, some > > > 40 or so hours since I set off the test without any failures. > > > > That''s great! > > > > > There''s > > > one difference between this test and previous ones: I''ve > > disabled the > > > blanking of the screen - there appears to be a problem > > waking the screen > > > after some time, not sure why that would be. > > > > What are the symptoms there? Is the guest still alive? Is qemu-dm > > alive? Does it respond on the network, and just have a > > wedged console? > > (Might it be the keyboard + mouse that have got wedged?) > > Good question. It turns out (from an attempt to stop the guest nicely > when I was going to reboot to have a new Linux-kernel with > debug code in > it) that although the guest is still running, I have actually lost at > least: > - Network. I can''t ping the guest or SSH to the guest on the > IP address > it used to be when it first got an address from DHCP - presumably, the > IP address shouldn''t change (it doesn''t on other machines that get IP > address from the same DHCP server). > - Keyboard. Pressing for example CTRL-C to stop the running > application > doesn''t work. No other keys appear to have any effect either. > > It''s unclear to me if any other operations are affected or not. [Time > seemed a bit funny too, but that may be my app - I haven''t > debugged that > yet. It kept cycling around a 2-3 second range around 23h14m18-20s (or > some such), where the time comes from "time()" - so perhaps there''s > something wrong in the "gettimeofday" functionality too.] > > > > > > 2. "Simple-guest" fails to restore on the second restore, > > ending up with > > > the guest "killed". Scanning the xend.log, I find "error > > zeroing magic > > > pages". Looking further down that path, it seems like it''s > > failing to do > > > "xc_map_foreign_range"... I''m adding some debug output to try to > > > determine where it goes wrong here. > > > > Strange. Are you doing anything wierd with the ioreq or > > xenstore pages > > in the simple guest? Their PFNs should have been maintained > > across the > > first save/restore cycle, and they were mappable the first time... > > I''m trying to see what fails and where by printing something at every > failure point. So far I''ve tracked it down to somewhere inside the > function direct_remap_pfn_range... Not sure where in this function it > goes wrong or where in any of the called functions. As far as > I can see, > there''s not many things that can go wrong there...Error is 14, which is "EFAULT", which means that the problem appears to be inside the hypercall. I''ll see if I can print the different pages involved here. Also, I missed answering the question of what I do with those pages: Nothing. My guest uses about 2MB of the entire 32MB memory range, around 1MB-3MB. -- Mats> > -- > Mats > > > > Cheers, > > > > Tim. > > > > -- > > Tim Deegan <Tim.Deegan@xensource.com>, XenSource UK Limited > > Registered office c/o EC2Y 5EB, UK; company number 05334508 > > > > > > > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> -----Original Message----- > From: Petersson, Mats > Sent: 25 April 2007 16:18 > To: Petersson, Mats; Tim Deegan > Cc: xen-devel@lists.xensource.com; Woller, Thomas > Subject: RE: [Xen-devel] HVM Save/Restore status. > > > > > -----Original Message----- > > From: xen-devel-bounces@lists.xensource.com > > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of > > Petersson, Mats > > Sent: 25 April 2007 15:25 > > To: Tim Deegan > > Cc: xen-devel@lists.xensource.com; Woller, Thomas > > Subject: RE: [Xen-devel] HVM Save/Restore status. > > > > > > > > > -----Original Message----- > > > From: Tim Deegan [mailto:Tim.Deegan@xensource.com] > > > Sent: 25 April 2007 15:09 > > > To: Petersson, Mats > > > Cc: xen-devel@lists.xensource.com; Woller, Thomas > > > Subject: Re: [Xen-devel] HVM Save/Restore status. > > > > > > Hi, > > > > > > At 12:58 +0200 on 25 Apr (1177505885), Petersson, Mats wrote: > > > > My "disk-stress" tests have the following status: > > > > 1. SLES 9.3 using VNC as display has run for over 23 > > > virtual hours, some > > > > 40 or so hours since I set off the test without any failures. > > > > > > That''s great! > > > > > > > There''s > > > > one difference between this test and previous ones: I''ve > > > disabled the > > > > blanking of the screen - there appears to be a problem > > > waking the screen > > > > after some time, not sure why that would be. > > > > > > What are the symptoms there? Is the guest still alive? > Is qemu-dm > > > alive? Does it respond on the network, and just have a > > > wedged console? > > > (Might it be the keyboard + mouse that have got wedged?) > > > > Good question. It turns out (from an attempt to stop the > guest nicely > > when I was going to reboot to have a new Linux-kernel with > > debug code in > > it) that although the guest is still running, I have > actually lost at > > least: > > - Network. I can''t ping the guest or SSH to the guest on the > > IP address > > it used to be when it first got an address from DHCP - > presumably, the > > IP address shouldn''t change (it doesn''t on other machines > that get IP > > address from the same DHCP server). > > - Keyboard. Pressing for example CTRL-C to stop the running > > application > > doesn''t work. No other keys appear to have any effect either. > > > > It''s unclear to me if any other operations are affected or > not. [Time > > seemed a bit funny too, but that may be my app - I haven''t > > debugged that > > yet. It kept cycling around a 2-3 second range around > 23h14m18-20s (or > > some such), where the time comes from "time()" - so perhaps there''s > > something wrong in the "gettimeofday" functionality too.] > > > > > > > > > 2. "Simple-guest" fails to restore on the second restore, > > > ending up with > > > > the guest "killed". Scanning the xend.log, I find "error > > > zeroing magic > > > > pages". Looking further down that path, it seems like it''s > > > failing to do > > > > "xc_map_foreign_range"... I''m adding some debug output to try to > > > > determine where it goes wrong here. > > > > > > Strange. Are you doing anything wierd with the ioreq or > > > xenstore pages > > > in the simple guest? Their PFNs should have been maintained > > > across the > > > first save/restore cycle, and they were mappable the first time... > > > > I''m trying to see what fails and where by printing > something at every > > failure point. So far I''ve tracked it down to somewhere inside the > > function direct_remap_pfn_range... Not sure where in this > function it > > goes wrong or where in any of the called functions. As far as > > I can see, > > there''s not many things that can go wrong there... > > Error is 14, which is "EFAULT", which means that the problem > appears to be inside the hypercall. > > I''ll see if I can print the different pages involved here.So, a few printf later: The first time (which succeds) and the second time (which fails) is exactly the same frame numbers (1fff, 1ffe, 1ffd). It fails on the FIRST (I split the "if( ... [0] || ... [1] || ... [2] )" into separate lines, and print the failure on each with a "[n]" to indicate which one failed, and it got [0] in the printout. -- Mats> > Also, I missed answering the question of what I do with those > pages: Nothing. My guest uses about 2MB of the entire 32MB > memory range, around 1MB-3MB. > > -- > Mats > > > > -- > > Mats > > > > > > Cheers, > > > > > > Tim. > > > > > > -- > > > Tim Deegan <Tim.Deegan@xensource.com>, XenSource UK Limited > > > Registered office c/o EC2Y 5EB, UK; company number 05334508 > > > > > > > > > > > > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel > > > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 25/4/07 17:15, "Petersson, Mats" <Mats.Petersson@amd.com> wrote:> So, a few printf later: The first time (which succeds) and the second > time (which fails) is exactly the same frame numbers (1fff, 1ffe, 1ffd). > It fails on the FIRST (I split the "if( ... [0] || ... [1] || ... [2] )" > into separate lines, and print the failure on each with a "[n]" to > indicate which one failed, and it got [0] in the printout.Mats, Can you try adding 1 to p2m_size after the line: p2m_size = xc_memory_op(xc_handle, XENMEM_maximum_gpfn, &dom); In xc_domain_save.c, please. I think we have an out-by-one error that you are triggering because your mini-domain does not drive the cirrus_vga lfb and hence does not have any ''video ram'' mapped in the RAM hole. You might also want to print p2m_size in xc_domain_save and confirm this hypothesis that way too. This would also bite us for other guests with more than 4GB (we''d lose a page per save/restore, I think). So this is a nice one to fix before 3.0.5 if I''m right! -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> -----Original Message----- > From: Keir Fraser [mailto:keir@xensource.com] > Sent: 25 April 2007 17:51 > To: Petersson, Mats; Tim Deegan > Cc: xen-devel@lists.xensource.com; Woller, Thomas > Subject: Re: [Xen-devel] HVM Save/Restore status. > > > > > On 25/4/07 17:15, "Petersson, Mats" <Mats.Petersson@amd.com> wrote: > > > So, a few printf later: The first time (which succeds) and > the second > > time (which fails) is exactly the same frame numbers (1fff, > 1ffe, 1ffd). > > It fails on the FIRST (I split the "if( ... [0] || ... [1] > || ... [2] )" > > into separate lines, and print the failure on each with a "[n]" to > > indicate which one failed, and it got [0] in the printout. > > Mats, > > Can you try adding 1 to p2m_size after the line: > p2m_size = xc_memory_op(xc_handle, XENMEM_maximum_gpfn, &dom); > In xc_domain_save.c, please. I think we have an out-by-one > error that you > are triggering because your mini-domain does not drive the > cirrus_vga lfb > and hence does not have any ''video ram'' mapped in the RAM > hole. You might > also want to print p2m_size in xc_domain_save and confirm > this hypothesis > that way too.Ok, here goes: First save: p2m_size = 0xfffff (succeeds) Second sace: p2m_size = 0x1fff (fails) I''m a little bit surprized about the first number, as it''s about 4GB(?) (my domain is officially only 32MB, and uses a whole lot less actual memory), but I guess the second number should be 0x2000 if it''s the actual size rather than the highest number of pfn available in the guest. Does that make sense to you? [Aside from my printout, there''s also a printout already in the xend.log from xc_domain_restore start: p2m_size = xxxxx which displays the same data as I''ve reported above, both before my change and after it, so I do believe my printout isn''t completely bogus]. -- Mats> > This would also bite us for other guests with more than 4GB > (we''d lose a > page per save/restore, I think). So this is a nice one to fix > before 3.0.5 > if I''m right! > > -- Keir > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 25/4/07 18:07, "Petersson, Mats" <Mats.Petersson@amd.com> wrote:> First save: p2m_size = 0xfffff (succeeds) > Second sace: p2m_size = 0x1fff (fails) > > I''m a little bit surprized about the first number, as it''s about 4GB(?) > (my domain is officially only 32MB, and uses a whole lot less actual > memory), but I guess the second number should be 0x2000 if it''s the > actual size rather than the highest number of pfn available in the > guest. Does that make sense to you?This seems to confirm my suspicion. Can you try doing +1 on the p2m_size now? It should fix your problem. If so I''ll do a full fix tomorrow. Thanks, Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> -----Original Message----- > From: Woller, Thomas > Sent: 26 April 2007 20:43 > To: Petersson, Mats > Subject: FW: [Xen-devel] HVM Save/Restore status. > > How did this turn out? Nice work btw! >[snip]> > This seems to confirm my suspicion. Can you try doing +1 on > the p2m_size now? It should fix your problem. If so I''ll do a > full fix tomorrow.I was out for holiday on Thursday, sorry I forgot to say that in my last e-mail on Wednesday. I tried that quickly this morning, then the official fix, both of which work just fine. The simple-guest still hangs after some hundreds of save/restore cycles (which doesn''t happen when it''s just running normall). Not sure what causes this. I would also like to figure out why the linux guest looses KB/Net communications after some time. But this is certainly an improvement, thanks Keir for the idea + fix. -- Mats> > Thanks, > Keir > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel