Jean-Yves Migeon
2008-Dec-04 17:26 UTC
[Xen-devel] Invalid types between save and restore, Xen 3.1.4
Hi list, I am currently charged with the implementation of save/restore/migrate inside NetBSD. So far, my current work does manage to save/restore a NetBSD domU, but I am erratically (one out of ten) facing issues regarding page type validation and pinning when cycling saves/restores. For unknown reasons, the save operation works, but restore might fail, with xend reporting: [2008-12-04 17:24:40 219] INFO (XendCheckpoint:370) Received all pages (0 races) [2008-12-04 17:24:40 219] INFO (XendCheckpoint:370) ERROR Internal error: Failed to pin batch of 21 page tables [2008-12-04 17:24:40 219] INFO (XendCheckpoint:370) Restore exit with rc=1 This is due to hypervisor refusing some type validation when xc_restore is issuing its xc_mmuext_op(): (XEN) mm.c:1842:d0 Bad type (saw 28000008 != exp e0000000) for mfn 1f16f (pfn 43e) (XEN) mm.c:649:d0 Error getting mfn 1f16f (pfn 43e) from L1 entry 1f16f023 for dom13 (XEN) mm.c:916:d0 Failure in alloc_l1_table: entry 768 (XEN) mm.c:1863:d0 Error while validating mfn 1ee38 (pfn 775) for type 20000000: caf=80000003 taf=20000001 (XEN) mm.c:683: get_l2_linear_pagetable() ret: 0 (exp 1) (XEN) mm.c:1091:d0 Failure in alloc_l2_table: entry 1007 (XEN) mm.c:1863:d0 Error while validating mfn 1efb4 (pfn 5f9) for type 40000000: caf=80000003 taf=40000001 (XEN) mm.c:2132:d0 Error while pinning mfn 1efb4 It is kind of erratic, and hard to reproduce. I suppose that I am facing a race inside VM code, but as I am not familiar with Xen''s inner workings with MMU, I am having a hard time tracking it. The L1 and L2 entries at fault are always the same. The 1007 L2 entry corresponds to an "alternative" recursive PD in our VM subsystem, and the L1 768 is the start of our kernel''s virtual memory. This is with Xen 3.1.4. NetBSD does not use writable mappings, and manipulates MMU only through the hypercall API. MFN''s manipulation are suspended during a save, to avoid any incorrect one after a restore. What I would like to know is the kind of operations that could result on such a situation. Considering that the xentools should have an accurate view of the pfn_types through the p2m table, how could it become possible that between save and restore, hypervisor refuses to validate pages, as mappings should not change after the call to HYPERVISOR_suspend()? For example, why is Xen expecting a writable mapping while the page is validated as L1? I was wondering if anyone could shed some light for me. Please correct me if I am wrong. Thanking you in advance for your help, -- Jean-Yves Migeon jeanyves.migeon@free.fr _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Dec-04 18:12 UTC
Re: [Xen-devel] Invalid types between save and restore, Xen 3.1.4
On 04/12/2008 17:26, "Jean-Yves Migeon" <jeanyves.migeon@free.fr> wrote:> What I would like to know is the kind of operations that could result on > such a situation. Considering that the xentools should have an accurate > view of the pfn_types through the p2m table, how could it become > possible that between save and restore, hypervisor refuses to validate > pages, as mappings should not change after the call to HYPERVISOR_suspend()? > > For example, why is Xen expecting a writable mapping while the page is > validated as L1?My guess is that Xen''s existing save/restore code is not compatible with your ''alternative'' recursive PDs. For such a recursive PD to be detected, the PD being mapped must *already* be validated as a PD. Otherwise (let''s assume 2-level pagetables for a moment, with levels called PD and PT) if the mapped PD is not yet validated, it will by default get validated as a PT! This explains what you see: the pages mapped by the PD are not interpreted as PTs but as data pages (because Xen has erroneously decided that the PD is a PT). Then it will try to validate them as writable data pages and get confused because some of them are already validated as pagetable pages! How to fix this... Well: (a) Hack xc_domain_save.c and xc_domain_restore.c a bunch. Not fun. (b) In the NetBSD kernel, zap alternative recursive PDs before suspending yourself. If this is possible it will save you a headache. Perhaps you can flush them somehow, or otherwise zap _PAGE_PRESENT and then reinstate it yourself during resume? If you have to go down route (a)... I''d have to think a bit about how best to fix the issue. Oh, I''ll add that this whole issue will definitely not exist for *self* recursive PDs. Those will work no problem. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel