thr3ads.net - Xen devel - [Xen-devel] HVM Save/Restore status. [Apr 2007]

If this information is useful, please help other people find it:
Share via:

Petersson, Mats

2007-Apr-25 10:58 UTC

[Xen-devel] HVM Save/Restore status.

My "disk-stress" tests have the following status:
1. SLES 9.3 using VNC as display has run for over 23 virtual hours, some
40 or so hours since I set off the test without any failures. There''s
one difference between this test and previous ones: I''ve disabled the
blanking of the screen - there appears to be a problem waking the screen
after some time, not sure why that would be. 

2. "Simple-guest" fails to restore on the second restore, ending up
with
the guest "killed". Scanning the xend.log, I find "error zeroing
magic
pages". Looking further down that path, it seems like it''s failing
to do
"xc_map_foreign_range"... I''m adding some debug output to try
to
determine where it goes wrong here. 


I''m going to concentrate on this to see if I can track down where the
simple-guest fails first, then look at why blanking the screen makes
things go wrong.

I will also try with SDL once I''ve rebooted the system with a newer Xen
build. 

--
Mats




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tim Deegan

2007-Apr-25 14:09 UTC

head link

Re: [Xen-devel] HVM Save/Restore status.

Hi, 

At 12:58 +0200 on 25 Apr (1177505885), Petersson, Mats
wrote:> My "disk-stress" tests have the following status:
> 1. SLES 9.3 using VNC as display has run for over 23 virtual hours, some
> 40 or so hours since I set off the test without any failures.
That''s great! 
> There''s
> one difference between this test and previous ones: I''ve disabled
the
> blanking of the screen - there appears to be a problem waking the screen
> after some time, not sure why that would be. 
What are the symptoms there?  Is the guest still alive?  Is qemu-dm
alive?  Does it respond on the network, and just have a wedged console?
(Might it be the keyboard + mouse that have got wedged?)
> 2. "Simple-guest" fails to restore on the second restore, ending
up with
> the guest "killed". Scanning the xend.log, I find "error
zeroing magic
> pages". Looking further down that path, it seems like it''s
failing to do
> "xc_map_foreign_range"... I''m adding some debug output
to try to
> determine where it goes wrong here. 
Strange.  Are you doing anything wierd with the ioreq or xenstore pages
in the simple guest?  Their PFNs should have been maintained across the
first save/restore cycle, and they were mappable the first time...

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@xensource.com>, XenSource UK Limited
Registered office c/o EC2Y 5EB, UK; company number 05334508

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Petersson, Mats

2007-Apr-25 14:24 UTC

head link

RE: [Xen-devel] HVM Save/Restore status.

> -----Original Message-----
> From: Tim Deegan [mailto:Tim.Deegan@xensource.com] 
> Sent: 25 April 2007 15:09
> To: Petersson, Mats
> Cc: xen-devel@lists.xensource.com; Woller, Thomas
> Subject: Re: [Xen-devel] HVM Save/Restore status.
> 
> Hi, 
> 
> At 12:58 +0200 on 25 Apr (1177505885), Petersson, Mats wrote:
> > My "disk-stress" tests have the following status:
> > 1. SLES 9.3 using VNC as display has run for over 23 
> virtual hours, some
> > 40 or so hours since I set off the test without any failures.
> 
> That''s great! 
> 
> > There''s
> > one difference between this test and previous ones: I''ve 
> disabled the
> > blanking of the screen - there appears to be a problem 
> waking the screen
> > after some time, not sure why that would be. 
> 
> What are the symptoms there?  Is the guest still alive?  Is qemu-dm
> alive?  Does it respond on the network, and just have a 
> wedged console?
> (Might it be the keyboard + mouse that have got wedged?)
Good question. It turns out (from an attempt to stop the guest nicely
when I was going to reboot to have a new Linux-kernel with debug code in
it) that although the guest is still running, I have actually lost at
least:
- Network. I can''t ping the guest or SSH to the guest on the IP address
it used to be when it first got an address from DHCP - presumably, the
IP address shouldn''t change (it doesn''t on other machines that
get IP
address from the same DHCP server). 
- Keyboard. Pressing for example CTRL-C to stop the running application
doesn''t work. No other keys appear to have any effect either.

It''s unclear to me if any other operations are affected or not. [Time
seemed a bit funny too, but that may be my app - I haven''t debugged
that
yet. It kept cycling around a 2-3 second range around 23h14m18-20s (or
some such), where the time comes from "time()" - so perhaps
there''s
something wrong in the "gettimeofday" functionality too.] 
> 
> > 2. "Simple-guest" fails to restore on the second restore, 
> ending up with
> > the guest "killed". Scanning the xend.log, I find
"error
> zeroing magic
> > pages". Looking further down that path, it seems like
it''s
> failing to do
> > "xc_map_foreign_range"... I''m adding some debug
output to try to
> > determine where it goes wrong here. 
> 
> Strange.  Are you doing anything wierd with the ioreq or 
> xenstore pages
> in the simple guest?  Their PFNs should have been maintained 
> across the
> first save/restore cycle, and they were mappable the first time...
I''m trying to see what fails and where by printing something at every
failure point. So far I''ve tracked it down to somewhere inside the
function direct_remap_pfn_range... Not sure where in this function it
goes wrong or where in any of the called functions. As far as I can see,
there''s not many things that can go wrong there... 

--
Mats> 
> Cheers,
> 
> Tim.
> 
> -- 
> Tim Deegan <Tim.Deegan@xensource.com>, XenSource UK Limited
> Registered office c/o EC2Y 5EB, UK; company number 05334508
> 
> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Petersson, Mats

2007-Apr-25 15:17 UTC

head link

RE: [Xen-devel] HVM Save/Restore status.

> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com 
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
> Petersson, Mats
> Sent: 25 April 2007 15:25
> To: Tim Deegan
> Cc: xen-devel@lists.xensource.com; Woller, Thomas
> Subject: RE: [Xen-devel] HVM Save/Restore status.
> 
>  
> 
> > -----Original Message-----
> > From: Tim Deegan [mailto:Tim.Deegan@xensource.com] 
> > Sent: 25 April 2007 15:09
> > To: Petersson, Mats
> > Cc: xen-devel@lists.xensource.com; Woller, Thomas
> > Subject: Re: [Xen-devel] HVM Save/Restore status.
> > 
> > Hi, 
> > 
> > At 12:58 +0200 on 25 Apr (1177505885), Petersson, Mats wrote:
> > > My "disk-stress" tests have the following status:
> > > 1. SLES 9.3 using VNC as display has run for over 23 
> > virtual hours, some
> > > 40 or so hours since I set off the test without any failures.
> > 
> > That''s great! 
> > 
> > > There''s
> > > one difference between this test and previous ones: I''ve
> > disabled the
> > > blanking of the screen - there appears to be a problem 
> > waking the screen
> > > after some time, not sure why that would be. 
> > 
> > What are the symptoms there?  Is the guest still alive?  Is qemu-dm
> > alive?  Does it respond on the network, and just have a 
> > wedged console?
> > (Might it be the keyboard + mouse that have got wedged?)
> 
> Good question. It turns out (from an attempt to stop the guest nicely
> when I was going to reboot to have a new Linux-kernel with 
> debug code in
> it) that although the guest is still running, I have actually lost at
> least:
> - Network. I can''t ping the guest or SSH to the guest on the 
> IP address
> it used to be when it first got an address from DHCP - presumably, the
> IP address shouldn''t change (it doesn''t on other machines
that get IP
> address from the same DHCP server). 
> - Keyboard. Pressing for example CTRL-C to stop the running 
> application
> doesn''t work. No other keys appear to have any effect either.
> 
> It''s unclear to me if any other operations are affected or not.
[Time
> seemed a bit funny too, but that may be my app - I haven''t 
> debugged that
> yet. It kept cycling around a 2-3 second range around 23h14m18-20s (or
> some such), where the time comes from "time()" - so perhaps
there''s
> something wrong in the "gettimeofday" functionality too.] 
> 
> > 
> > > 2. "Simple-guest" fails to restore on the second
restore,
> > ending up with
> > > the guest "killed". Scanning the xend.log, I find
"error
> > zeroing magic
> > > pages". Looking further down that path, it seems like
it''s
> > failing to do
> > > "xc_map_foreign_range"... I''m adding some
debug output to try to
> > > determine where it goes wrong here. 
> > 
> > Strange.  Are you doing anything wierd with the ioreq or 
> > xenstore pages
> > in the simple guest?  Their PFNs should have been maintained 
> > across the
> > first save/restore cycle, and they were mappable the first time...
> 
> I''m trying to see what fails and where by printing something at
every
> failure point. So far I''ve tracked it down to somewhere inside the
> function direct_remap_pfn_range... Not sure where in this function it
> goes wrong or where in any of the called functions. As far as 
> I can see,
> there''s not many things that can go wrong there... 
Error is 14, which is "EFAULT", which means that the problem appears
to
be inside the hypercall. 

I''ll see if I can print the different pages involved here. 

Also, I missed answering the question of what I do with those pages:
Nothing. My guest uses about 2MB of the entire 32MB memory range, around
1MB-3MB. 

--
Mats> 
> --
> Mats
> > 
> > Cheers,
> > 
> > Tim.
> > 
> > -- 
> > Tim Deegan <Tim.Deegan@xensource.com>, XenSource UK Limited
> > Registered office c/o EC2Y 5EB, UK; company number 05334508
> > 
> > 
> > 
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
> 
> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Petersson, Mats

2007-Apr-25 16:15 UTC

head link

RE: [Xen-devel] HVM Save/Restore status.

> -----Original Message-----
> From: Petersson, Mats 
> Sent: 25 April 2007 16:18
> To: Petersson, Mats; Tim Deegan
> Cc: xen-devel@lists.xensource.com; Woller, Thomas
> Subject: RE: [Xen-devel] HVM Save/Restore status.
> 
>  
> 
> > -----Original Message-----
> > From: xen-devel-bounces@lists.xensource.com 
> > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
> > Petersson, Mats
> > Sent: 25 April 2007 15:25
> > To: Tim Deegan
> > Cc: xen-devel@lists.xensource.com; Woller, Thomas
> > Subject: RE: [Xen-devel] HVM Save/Restore status.
> > 
> >  
> > 
> > > -----Original Message-----
> > > From: Tim Deegan [mailto:Tim.Deegan@xensource.com] 
> > > Sent: 25 April 2007 15:09
> > > To: Petersson, Mats
> > > Cc: xen-devel@lists.xensource.com; Woller, Thomas
> > > Subject: Re: [Xen-devel] HVM Save/Restore status.
> > > 
> > > Hi, 
> > > 
> > > At 12:58 +0200 on 25 Apr (1177505885), Petersson, Mats wrote:
> > > > My "disk-stress" tests have the following status:
> > > > 1. SLES 9.3 using VNC as display has run for over 23 
> > > virtual hours, some
> > > > 40 or so hours since I set off the test without any
failures.
> > > 
> > > That''s great! 
> > > 
> > > > There''s
> > > > one difference between this test and previous ones:
I''ve
> > > disabled the
> > > > blanking of the screen - there appears to be a problem 
> > > waking the screen
> > > > after some time, not sure why that would be. 
> > > 
> > > What are the symptoms there?  Is the guest still alive?  
> Is qemu-dm
> > > alive?  Does it respond on the network, and just have a 
> > > wedged console?
> > > (Might it be the keyboard + mouse that have got wedged?)
> > 
> > Good question. It turns out (from an attempt to stop the 
> guest nicely
> > when I was going to reboot to have a new Linux-kernel with 
> > debug code in
> > it) that although the guest is still running, I have 
> actually lost at
> > least:
> > - Network. I can''t ping the guest or SSH to the guest on the 
> > IP address
> > it used to be when it first got an address from DHCP - 
> presumably, the
> > IP address shouldn''t change (it doesn''t on other
machines
> that get IP
> > address from the same DHCP server). 
> > - Keyboard. Pressing for example CTRL-C to stop the running 
> > application
> > doesn''t work. No other keys appear to have any effect either.
> > 
> > It''s unclear to me if any other operations are affected or 
> not. [Time
> > seemed a bit funny too, but that may be my app - I haven''t 
> > debugged that
> > yet. It kept cycling around a 2-3 second range around 
> 23h14m18-20s (or
> > some such), where the time comes from "time()" - so perhaps
there''s
> > something wrong in the "gettimeofday" functionality too.] 
> > 
> > > 
> > > > 2. "Simple-guest" fails to restore on the second
restore,
> > > ending up with
> > > > the guest "killed". Scanning the xend.log, I find
"error
> > > zeroing magic
> > > > pages". Looking further down that path, it seems like
it''s
> > > failing to do
> > > > "xc_map_foreign_range"... I''m adding some
debug output to try to
> > > > determine where it goes wrong here. 
> > > 
> > > Strange.  Are you doing anything wierd with the ioreq or 
> > > xenstore pages
> > > in the simple guest?  Their PFNs should have been maintained 
> > > across the
> > > first save/restore cycle, and they were mappable the first
time...
> > 
> > I''m trying to see what fails and where by printing 
> something at every
> > failure point. So far I''ve tracked it down to somewhere
inside the
> > function direct_remap_pfn_range... Not sure where in this 
> function it
> > goes wrong or where in any of the called functions. As far as 
> > I can see,
> > there''s not many things that can go wrong there... 
> 
> Error is 14, which is "EFAULT", which means that the problem 
> appears to be inside the hypercall. 
> 
> I''ll see if I can print the different pages involved here. 
So, a few printf later: The first time (which succeds) and the second
time (which fails) is exactly the same frame numbers (1fff, 1ffe, 1ffd).
It fails on the FIRST (I split the "if( ... [0] || ... [1] || ... [2]
)"
into separate lines, and print the failure on each with a "[n]" to
indicate which one failed, and it got [0] in the printout. 

--
Mats
> 
> Also, I missed answering the question of what I do with those 
> pages: Nothing. My guest uses about 2MB of the entire 32MB 
> memory range, around 1MB-3MB. 
> 
> --
> Mats
> > 
> > --
> > Mats
> > > 
> > > Cheers,
> > > 
> > > Tim.
> > > 
> > > -- 
> > > Tim Deegan <Tim.Deegan@xensource.com>, XenSource UK Limited
> > > Registered office c/o EC2Y 5EB, UK; company number 05334508
> > > 
> > > 
> > > 
> > 
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xensource.com
> > http://lists.xensource.com/xen-devel
> > 
> > 
> > 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2007-Apr-25 16:51 UTC

head link

Re: [Xen-devel] HVM Save/Restore status.

On 25/4/07 17:15, "Petersson, Mats" <Mats.Petersson@amd.com>
wrote:
> So, a few printf later: The first time (which succeds) and the second
> time (which fails) is exactly the same frame numbers (1fff, 1ffe, 1ffd).
> It fails on the FIRST (I split the "if( ... [0] || ... [1] || ... [2]
)"
> into separate lines, and print the failure on each with a "[n]"
to
> indicate which one failed, and it got [0] in the printout.
Mats,

Can you try adding 1 to p2m_size after the line:
p2m_size = xc_memory_op(xc_handle, XENMEM_maximum_gpfn, &dom);
In xc_domain_save.c, please. I think we have an out-by-one error that you
are triggering because your mini-domain does not drive the cirrus_vga lfb
and hence does not have any ''video ram'' mapped in the RAM
hole. You might
also want to print p2m_size in xc_domain_save and confirm this hypothesis
that way too.

This would also bite us for other guests with more than 4GB (we''d lose
a
page per save/restore, I think). So this is a nice one to fix before 3.0.5
if I''m right!

 -- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Petersson, Mats

2007-Apr-25 17:07 UTC

head link

RE: [Xen-devel] HVM Save/Restore status.

> -----Original Message-----
> From: Keir Fraser [mailto:keir@xensource.com] 
> Sent: 25 April 2007 17:51
> To: Petersson, Mats; Tim Deegan
> Cc: xen-devel@lists.xensource.com; Woller, Thomas
> Subject: Re: [Xen-devel] HVM Save/Restore status.
> 
> 
> 
> 
> On 25/4/07 17:15, "Petersson, Mats"
<Mats.Petersson@amd.com> wrote:
> 
> > So, a few printf later: The first time (which succeds) and 
> the second
> > time (which fails) is exactly the same frame numbers (1fff, 
> 1ffe, 1ffd).
> > It fails on the FIRST (I split the "if( ... [0] || ... [1] 
> || ... [2] )"
> > into separate lines, and print the failure on each with a
"[n]" to
> > indicate which one failed, and it got [0] in the printout.
> 
> Mats,
> 
> Can you try adding 1 to p2m_size after the line:
> p2m_size = xc_memory_op(xc_handle, XENMEM_maximum_gpfn, &dom);
> In xc_domain_save.c, please. I think we have an out-by-one 
> error that you
> are triggering because your mini-domain does not drive the 
> cirrus_vga lfb
> and hence does not have any ''video ram'' mapped in the RAM
> hole. You might
> also want to print p2m_size in xc_domain_save and confirm 
> this hypothesis
> that way too.
Ok, here goes:
First save: p2m_size = 0xfffff (succeeds)
Second sace: p2m_size = 0x1fff (fails)

I''m a little bit surprized about the first number, as it''s
about 4GB(?)
(my domain is officially only 32MB, and uses a whole lot less actual
memory), but I guess the second number should be 0x2000 if it''s the
actual size rather than the highest number of pfn available in the
guest. Does that make sense to you?

[Aside from my printout, there''s also a printout already in the
xend.log
from xc_domain_restore start: p2m_size = xxxxx which displays the same
data as I''ve reported above, both before my change and after it, so I
do
believe my printout isn''t completely bogus]. 

--
Mats> 
> This would also bite us for other guests with more than 4GB 
> (we''d lose a
> page per save/restore, I think). So this is a nice one to fix 
> before 3.0.5
> if I''m right!
> 
>  -- Keir
> 
> 
> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2007-Apr-25 17:14 UTC

head link

Re: [Xen-devel] HVM Save/Restore status.

On 25/4/07 18:07, "Petersson, Mats" <Mats.Petersson@amd.com>
wrote:
> First save: p2m_size = 0xfffff (succeeds)
> Second sace: p2m_size = 0x1fff (fails)
> 
> I''m a little bit surprized about the first number, as
it''s about 4GB(?)
> (my domain is officially only 32MB, and uses a whole lot less actual
> memory), but I guess the second number should be 0x2000 if it''s
the
> actual size rather than the highest number of pfn available in the
> guest. Does that make sense to you?
This seems to confirm my suspicion. Can you try doing +1 on the p2m_size
now? It should fix your problem. If so I''ll do a full fix tomorrow.

 Thanks,
 Keir


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Petersson, Mats

2007-Apr-27 15:00 UTC

head link

RE: [Xen-devel] HVM Save/Restore status.

> -----Original Message-----
> From: Woller, Thomas 
> Sent: 26 April 2007 20:43
> To: Petersson, Mats
> Subject: FW: [Xen-devel] HVM Save/Restore status.
> 
> How did this turn out?  Nice work btw!
> 
[snip]> 
> This seems to confirm my suspicion. Can you try doing +1 on 
> the p2m_size now? It should fix your problem. If so I''ll do a 
> full fix tomorrow.
I was out for holiday on Thursday, sorry I forgot to say that in my last
e-mail on Wednesday. 

I tried that quickly this morning, then the official fix, both of which
work just fine. The simple-guest still hangs after some hundreds of
save/restore cycles (which doesn''t happen when it''s just
running
normall). Not sure what causes this. 

I would also like to figure out why the linux guest looses KB/Net
communications after some time. 

But this is certainly an improvement, thanks Keir for the idea + fix.

--
Mats> 
>  Thanks,
>  Keir
> 
> 
> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Apr 2007 - HVM Save/Restore status.

[Xen-devel] HVM Save/Restore status.

Re: [Xen-devel] HVM Save/Restore status.

RE: [Xen-devel] HVM Save/Restore status.

RE: [Xen-devel] HVM Save/Restore status.

RE: [Xen-devel] HVM Save/Restore status.

Re: [Xen-devel] HVM Save/Restore status.

RE: [Xen-devel] HVM Save/Restore status.

Re: [Xen-devel] HVM Save/Restore status.

RE: [Xen-devel] HVM Save/Restore status.