Graham, Simon
2006-Jul-17 19:17 UTC
[Xen-devel] Live migration fails when available memory exactly equal to required memory on target system
In diagnosing live migration failures (with 3.0.testing), I have noticed that a common failure is a lack of resources on the target system _and_ that this only seems to happen when the available resources at the time of the migration are exactly what is required for the VM being migrated. For example, here''s a xend.log extract from a failed case: [2006-07-17 14:38:56 xend] DEBUG (balloon:128) Balloon: free 265; need 265; done. [2006-07-17 14:38:56 xend] DEBUG (XendCheckpoint:148) [xc_restore]: /usr/lib/xen/bin/xc_restore 10 4 112 67584 1 2 [2006-07-17 14:38:57 xend] ERROR (XendCheckpoint:242) xc_linux_restore start: max_pfn = 10800 [2006-07-17 14:38:57 xend] ERROR (XendCheckpoint:242) Failed allocation for dom 112: 67584 pages order 0 addr_bits 0 [2006-07-17 14:38:57 xend] ERROR (XendCheckpoint:242) Failed to increase reservation by 42000 KB: 12 [2006-07-17 14:38:57 xend] ERROR (XendCheckpoint:242) Restore exit with rc=1 The nr_pfns parameter to xc_restore shows that we need 264MB - balloon.py added a slop of 1MB to that to come up with the 265 number. Immediately following this failed attempt, I tried again: [2006-07-17 14:38:58 xend] DEBUG (balloon:134) Balloon: free 264; need 265; retries: 10. [2006-07-17 14:38:58 xend] DEBUG (balloon:143) Balloon: setting dom0 target to 1235. [2006-07-17 14:38:58 xend.XendDomainInfo] DEBUG (XendDomainInfo:945) Setting memory target of domain Domain-0 (0) to 1235 MiB. [2006-07-17 14:38:58 xend] DEBUG (balloon:128) Balloon: free 265; need 265; done. [2006-07-17 14:38:58 xend] DEBUG (XendCheckpoint:148) [xc_restore]: /usr/lib/xen/bin/xc_restore 10 4 113 67584 1 2 [2006-07-17 14:38:59 xend] ERROR (XendCheckpoint:242) xc_linux_restore start: max_pfn = 10800 [2006-07-17 14:38:59 xend] ERROR (XendCheckpoint:242) Increased domain reservation by 42000 KB This time, we can see that there was only 264MB free so we had to kick the balloon driver to free up 1MB - once this was done (and we had exactly 265MB free again), we were able to increase the reservation for the target DomU to the requested amount... The above is fairly reproducible but I''m not sure where to go next to figure out where the issue really is (or, indeed, if there really is an issue -- maybe this is just one of those inherently racy things; however, I find it odd that it only seems to happen when the initial free is exactly the same as the desired; I have plenty of other cases where there is way more and way less memory available all of which seem to work just fine). Any suggestions? /simgr _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2006-Jul-17 19:33 UTC
Re: [Xen-devel] Live migration fails when available memory exactly equal to required memory on target system
On 17 Jul 2006, at 20:17, Graham, Simon wrote:> This time, we can see that there was only 264MB free so we had to kick > the balloon driver to free up 1MB - once this was done (and we had > exactly 265MB free again), we were able to increase the reservation for > the target DomU to the requested amount... > > The above is fairly reproducible but I''m not sure where to go next to > figure out where the issue really is (or, indeed, if there really is an > issue -- maybe this is just one of those inherently racy things; > however, I find it odd that it only seems to happen when the initial > free is exactly the same as the desired; I have plenty of other cases > where there is way more and way less memory available all of which seem > to work just fine).Maybe there''s a rounding error in the dom0 ballooning code. Your best bet is to get tracing out of Xen, which can tell you exactly how many pages are on Xen''s free lists and how many are requested for the new domain being created -- a debug build of Xen would be a good start as that will turn on the DPRINTK tracing in Xen. Also bear in mind that the amount of free memory is always fluctuating in a live system -- for example the Xen network drivers are continually freeing and allocating memory to/from Xen. So if the dom0 ballooning logic is only freeing *exactly* what is required then that is probably a bit stupid: maybe some slack needs to be added (or the amount of slack increased). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Graham, Simon
2006-Jul-18 01:28 UTC
RE: [Xen-devel] Live migration fails when available memory exactly equal to required memory on target system
I initially was also of the opinion that it was simply other code grabbing the memory between the time the balloon driver freed it and the time xc_restore attempted to grab it, _but_ that doesn''t explain why it only seems to happen when the original free amount is exactly the required amount (I cant prove this is always the case, but my observations so far is that it is). I looked at the rounding code in balloon.py -- it''s right although I do think having a slack of only 1 MB is perhaps a little low. I will try the debug build of Xen (just as soon as this Windows-DDK guy can figure out how to build it ;-) Thanks, /simgr -----Original Message----- From: Keir Fraser [mailto:Keir.Fraser@cl.cam.ac.uk] Sent: Monday, July 17, 2006 3:34 PM To: Graham, Simon Cc: xen-devel@lists.xensource.com Subject: Re: [Xen-devel] Live migration fails when available memory exactly equal to required memory on target system On 17 Jul 2006, at 20:17, Graham, Simon wrote:> This time, we can see that there was only 264MB free so we had to kick > the balloon driver to free up 1MB - once this was done (and we had > exactly 265MB free again), we were able to increase the reservationfor> the target DomU to the requested amount... > > The above is fairly reproducible but I''m not sure where to go next to > figure out where the issue really is (or, indeed, if there really isan> issue -- maybe this is just one of those inherently racy things; > however, I find it odd that it only seems to happen when the initial > free is exactly the same as the desired; I have plenty of other cases > where there is way more and way less memory available all of whichseem> to work just fine).Maybe there''s a rounding error in the dom0 ballooning code. Your best bet is to get tracing out of Xen, which can tell you exactly how many pages are on Xen''s free lists and how many are requested for the new domain being created -- a debug build of Xen would be a good start as that will turn on the DPRINTK tracing in Xen. Also bear in mind that the amount of free memory is always fluctuating in a live system -- for example the Xen network drivers are continually freeing and allocating memory to/from Xen. So if the dom0 ballooning logic is only freeing *exactly* what is required then that is probably a bit stupid: maybe some slack needs to be added (or the amount of slack increased). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel