While doing "while xm migrate --live domU localhost;do sleep 2;done" I see many errors from get_page: ... (XEN) HVM56 restore: TSC_ADJUST 0 (XEN) HVM56 restore: TSC_ADJUST 1 (XEN) mm.c:1982:d0 Error pfn 41a863: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=7400000000000001 (XEN) mm.c:1982:d0 Error pfn 41be1c: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=7400000000000001 (XEN) mm.c:1982:d0 Error pfn 41a862: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=7400000000000001 (XEN) mm.c:1982:d0 Error pfn 41b90f: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=7400000000000001 (XEN) mm.c:1982:d0 Error pfn 41b49a: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=7400000000000001 (XEN) mm.c:1982:d0 Error pfn 41b48d: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=7400000000000001 (XEN) irq.c:375: Dom56 callback via changed to Direct Vector 0xf3 (XEN) HVM56 save: CPU ... The pfn number and the amount of pfn differs during iterations, but in the end only these two variants appear: # xm dmesg | grep -w mm | cut -d : -f 4- | sort | uniq -c | sort 22 rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=7400000000000001 46 rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=0000000000000001 It does not seem to cause issues other than the log output. Does it indiciate a real bug? Olaf
Jan Beulich
2013-Feb-21 16:42 UTC
Re: error in xen/arch/x86/mm.c:get_page during migration
>>> On 21.02.13 at 15:48, Olaf Hering <olaf@aepfle.de> wrote:> While doing "while xm migrate --live domU localhost;do sleep 2;done" I > see many errors from get_page: > > ... > (XEN) HVM56 restore: TSC_ADJUST 0 > (XEN) HVM56 restore: TSC_ADJUST 1 > (XEN) mm.c:1982:d0 Error pfn 41a863: rd=ffff83036ffef000, > od=0000000000000000, caf=180000000000000, taf=7400000000000001 > (XEN) mm.c:1982:d0 Error pfn 41be1c: rd=ffff83036ffef000, > od=0000000000000000, caf=180000000000000, taf=7400000000000001 > (XEN) mm.c:1982:d0 Error pfn 41a862: rd=ffff83036ffef000, > od=0000000000000000, caf=180000000000000, taf=7400000000000001 > (XEN) mm.c:1982:d0 Error pfn 41b90f: rd=ffff83036ffef000, > od=0000000000000000, caf=180000000000000, taf=7400000000000001 > (XEN) mm.c:1982:d0 Error pfn 41b49a: rd=ffff83036ffef000, > od=0000000000000000, caf=180000000000000, taf=7400000000000001 > (XEN) mm.c:1982:d0 Error pfn 41b48d: rd=ffff83036ffef000, > od=0000000000000000, caf=180000000000000, taf=7400000000000001 > (XEN) irq.c:375: Dom56 callback via changed to Direct Vector 0xf3 > (XEN) HVM56 save: CPU > ... > > The pfn number and the amount of pfn differs during iterations, but in the > end > only these two variants appear: > > # xm dmesg | grep -w mm | cut -d : -f 4- | sort | uniq -c | sort > 22 rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, > taf=7400000000000001 > 46 rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, > taf=0000000000000001 > > > It does not seem to cause issues other than the log output. > Does it indiciate a real bug?I''m afraid it does - a non-zero type count should generally not be accompanied by a zero general count. That''s specifically because lone put_page_type() calls are pretty rare, and going through all of them I don''t see anyone that could be one being outstanding in your case. I''m surprised this doesn''t cause an assertion to trigger somewhere. You are using a debug hypervisor, aren''t you? Of course, if this truly is just a "leaked" type reference, then no other bad consequences are to be afraid of. What you could do to get a better understanding of when this happens is to add a WARN_ON() alongside the printk() (perhaps such that it triggers only once for each of the two different cases), and then let us look at the call trace. Jan
Olaf Hering
2013-Feb-21 17:31 UTC
Re: error in xen/arch/x86/mm.c:get_page during migration
On Thu, Feb 21, Jan Beulich wrote:> What you could do to get a better understanding of when this > happens is to add a WARN_ON() alongside the printk() (perhaps > such that it triggers only once for each of the two different > cases), and then let us look at the call trace.It did not happen with xl. Here is the output while doing xm migrate: (XEN) HVM2 restore: VMCE_VCPU 0 (XEN) HVM2 restore: VMCE_VCPU 1 (XEN) HVM2 restore: TSC_ADJUST 0 (XEN) HVM2 restore: TSC_ADJUST 1 (XEN) mm.c:1983:d0 Error pfn 4112c5: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=7400000000000001 (XEN) Xen WARN at mm.c:1986 (XEN) ----[ Xen-4.3.26579-20130221.171413 x86_64 debug=y Not tainted ]---- (XEN) CPU: 20 (XEN) RIP: e008:[<ffff82c4c0170fb2>] get_page+0xfb/0x151 (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor (XEN) rax: 7400000000000001 rbx: 7400000000000001 rcx: 0000000000000000 (XEN) rdx: 7400000000000001 rsi: 000000000000000a rdi: ffff82c4c0280748 (XEN) rbp: ffff83036d5f7958 rsp: ffff83036d5f7908 r8: 0000000000000014 (XEN) r9: 0000000000000004 r10: 0000000000000004 r11: 0000000000000001 (XEN) r12: 0180000000000000 r13: 0000000000000000 r14: ffff83036ffef000 (XEN) r15: ffff82e0082258a0 cr0: 000000008005003b cr4: 00000000000026f0 (XEN) cr3: 000000065e78f000 cr2: ffff8805ad260040 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 (XEN) Xen stack trace from rsp=ffff83036d5f7908: (XEN) 0000000000000000 0180000000000000 7400000000000001 ffff82c4c01e1726 (XEN) 00000000004112c5 ffff82e0082258a0 ffff83036d5f79fc ffff83036d5f799c (XEN) ffff830232aa49e0 ffff830232aa49e0 ffff83036d5f79c8 ffff82c4c01e1d87 (XEN) ffff830300000000 0000000000000086 000000026d5f7a04 00000000000190c5 (XEN) ffff830402126000 ffffffff01000086 000000076d5f7bc0 0000000000000000 (XEN) ffff830402126000 ffff83040225dc60 00000000000190c5 ffff83036d5f7ba0 (XEN) ffff83036d5f7a28 ffff82c4c01098ae ffff83036d5f7a98 ffff83036d5f7ba0 (XEN) ffff83036d5f7ab8 00000000000f03f8 000000006d5f7a08 0000000000000000 (XEN) 0000000000000240 ffff83040225dc60 0000000000000000 ffff83036d5f7ba0 (XEN) ffff83036d5f7ae8 ffff82c4c0109e55 ffff830402296200 0000000000000086 (XEN) 0000018300000009 00000000000000fd ffff83036d5f7bb0 000082e000000000 (XEN) ffff830402126000 ffff830402203c58 00000000002337b6 ffff830402296200 (XEN) 0000000000000000 ffff830402296200 ffff830402296200 000002406d5f7ba8 (XEN) ffff830300000000 ffff83036d5f7bb0 000000006d5f7b68 0000000000000000 (XEN) ffff83036d5ce000 0000000000000000 0000000000000000 ffff830402126000 (XEN) ffff83036d5f7c28 ffff82c4c010bef0 ffff83036d5f7bc4 ffff83036d5f7bc0 (XEN) ffff830300000001 0000000000000096 ffff83036d5f7bec ffff82c4c0319820 (XEN) ffff83036d5f0000 ffff83036d5f0000 ffff83036d5f0000 ffff83036d5f0000 (XEN) ffff83036d5f0000 ffff83036d5f7bc8 ffff83036d5f0000 0000000000000001 (XEN) ffffc90010283a40 0000000000000002 ffff83036d5f7bd8 ffff82c4c0125aa4 (XEN) Xen call trace: (XEN) [<ffff82c4c0170fb2>] get_page+0xfb/0x151 (XEN) [<ffff82c4c01e1d87>] get_page_from_gfn_p2m+0x17e/0x284 (XEN) [<ffff82c4c01098ae>] __get_paged_frame+0x5d/0x170 (XEN) [<ffff82c4c0109e55>] __acquire_grant_for_copy+0x494/0x6ae (XEN) [<ffff82c4c010bef0>] gnttab_copy+0x53b/0x843 (XEN) [<ffff82c4c010e3b8>] do_grant_table_op+0x11c5/0x1b82 (XEN) [<ffff82c4c011502f>] do_multicall+0x227/0x444 (XEN) [<ffff82c4c0227f0b>] syscall_enter+0xeb/0x145 (XEN) (XEN) mm.c:1983:d0 Error pfn 41144d: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=7400000000000001 (XEN) mm.c:1983:d0 Error pfn 4116b0: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=7400000000000001 (XEN) irq.c:375: Dom2 callback via changed to Direct Vector 0xf3 (XEN) HVM2 save: CPU ... (XEN) HVM3 restore: VMCE_VCPU 0 (XEN) HVM3 restore: VMCE_VCPU 1 (XEN) HVM3 restore: TSC_ADJUST 0 (XEN) HVM3 restore: TSC_ADJUST 1 (XEN) mm.c:1983:d0 Error pfn 43f7d4: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=0000000000000001 (XEN) Xen WARN at mm.c:1990 (XEN) ----[ Xen-4.3.26579-20130221.171413 x86_64 debug=y Not tainted ]---- (XEN) CPU: 14 (XEN) RIP: e008:[<ffff82c4c0170fdc>] get_page+0x125/0x151 (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor (XEN) rax: 7400000000000001 rbx: 0000000000000001 rcx: 0000000000000000 (XEN) rdx: 0000000000000001 rsi: 000000000000000a rdi: ffff82c4c0280748 (XEN) rbp: ffff83036ff2f958 rsp: ffff83036ff2f908 r8: 000000000000000e (XEN) r9: 0000000000000004 r10: 0000000000000004 r11: 0000000000000001 (XEN) r12: 0180000000000000 r13: 0000000000000000 r14: ffff83036ffef000 (XEN) r15: ffff82e0087efa80 cr0: 000000008005003b cr4: 00000000000026f0 (XEN) cr3: 0000000230871000 cr2: ffff8805abd77f20 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 (XEN) Xen stack trace from rsp=ffff83036ff2f908: (XEN) 0000000000000000 0180000000000000 0000000000000001 ffff82c4c01e1726 (XEN) 000000000043f7d4 ffff82e0087efa80 ffff83036ff2f9fc ffff83036ff2f99c (XEN) ffff830231fa8010 ffff830231fa8010 ffff83036ff2f9c8 ffff82c4c01e1d87 (XEN) ffff830300000000 0000000000000086 00000002900d0604 0000000000019110 (XEN) ffff83022fb64000 ffffffff01000048 000000076ff2fa18 0000000000000000 (XEN) ffff83022fb64000 ffff830232319db0 0000000000019110 ffff83036ff2fba0 (XEN) ffff83036ff2fa28 ffff82c4c01098ae ffff83036ff2fa58 ffff83036ff2fba0 (XEN) ffff83036ff2fab8 0000000000000004 0000000000000000 0000000000000000 (XEN) 0000000000000247 ffff830232319db0 0000000000000000 ffff83036ff2fba0 (XEN) ffff83036ff2fae8 ffff82c4c0109e55 ffff83022f019238 0000000000000000 (XEN) ffff83036ff2fad8 0000000000000004 ffff83036ff2fbb0 000082e000000000 (XEN) ffff83022fb64000 ffff8302325a0c58 00000000002338d7 ffff83022f019238 (XEN) 0000000000000000 ffff83022f019238 ffff83022f019238 000002476ff2fba8 (XEN) ffff830300000000 ffff8302325a0c58 000000006d5ce000 0000000000000000 (XEN) ffff83036d5ce000 0000000000000000 0000000000000000 ffff83022fb64000 (XEN) ffff83036ff2fc28 ffff82c4c010bef0 ffff83036ff2fbc4 ffff83036ff2fbc0 (XEN) ffff830300000001 0000000000000096 ffff83036ff2fbec ffff82c4c0319820 (XEN) ffff83036ff28000 ffff83036ff28000 ffff83036ff28000 ffff83036ff28000 (XEN) ffff83036ff28000 ffff83036ff2fbc8 ffff83036ff28000 0000000000000001 (XEN) ffffc90010283a40 0000000000000002 ffff83036ff2fbd8 ffff82c4c0125aa4 (XEN) Xen call trace: (XEN) [<ffff82c4c0170fdc>] get_page+0x125/0x151 (XEN) [<ffff82c4c01e1d87>] get_page_from_gfn_p2m+0x17e/0x284 (XEN) [<ffff82c4c01098ae>] __get_paged_frame+0x5d/0x170 (XEN) [<ffff82c4c0109e55>] __acquire_grant_for_copy+0x494/0x6ae (XEN) [<ffff82c4c010bef0>] gnttab_copy+0x53b/0x843 (XEN) [<ffff82c4c010e3b8>] do_grant_table_op+0x11c5/0x1b82 (XEN) [<ffff82c4c011502f>] do_multicall+0x227/0x444 (XEN) [<ffff82c4c0227f0b>] syscall_enter+0xeb/0x145 (XEN) (XEN) mm.c:1983:d0 Error pfn 43e646: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=0000000000000001 (XEN) mm.c:1983:d0 Error pfn 43f86a: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=0000000000000001 (XEN) mm.c:1983:d0 Error pfn 43e683: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=0000000000000001 (XEN) mm.c:1983:d0 Error pfn 43f31b: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=0000000000000001 (XEN) mm.c:1983:d0 Error pfn 43e5f0: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=0000000000000001 (XEN) mm.c:1983:d0 Error pfn 43f3b7: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=7400000000000001 (XEN) mm.c:1983:d0 Error pfn 43f87c: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=7400000000000001 (XEN) irq.c:375: Dom3 callback via changed to Direct Vector 0xf3 ... Olaf
Jan Beulich
2013-Feb-22 07:42 UTC
Re: error in xen/arch/x86/mm.c:get_page during migration
>>> On 21.02.13 at 18:31, Olaf Hering <olaf@aepfle.de> wrote: > On Thu, Feb 21, Jan Beulich wrote: > >> What you could do to get a better understanding of when this >> happens is to add a WARN_ON() alongside the printk() (perhaps >> such that it triggers only once for each of the two different >> cases), and then let us look at the call trace. > > It did not happen with xl.Odd.> Here is the output while doing xm migrate: > ... > (XEN) Xen call trace: > (XEN) [<ffff82c4c0170fb2>] get_page+0xfb/0x151 > (XEN) [<ffff82c4c01e1d87>] get_page_from_gfn_p2m+0x17e/0x284 > (XEN) [<ffff82c4c01098ae>] __get_paged_frame+0x5d/0x170 > (XEN) [<ffff82c4c0109e55>] __acquire_grant_for_copy+0x494/0x6ae > (XEN) [<ffff82c4c010bef0>] gnttab_copy+0x53b/0x843 > (XEN) [<ffff82c4c010e3b8>] do_grant_table_op+0x11c5/0x1b82 > (XEN) [<ffff82c4c011502f>] do_multicall+0x227/0x444 > (XEN) [<ffff82c4c0227f0b>] syscall_enter+0xeb/0x145 > ... > (XEN) Xen call trace: > (XEN) [<ffff82c4c0170fdc>] get_page+0x125/0x151 > (XEN) [<ffff82c4c01e1d87>] get_page_from_gfn_p2m+0x17e/0x284 > (XEN) [<ffff82c4c01098ae>] __get_paged_frame+0x5d/0x170 > (XEN) [<ffff82c4c0109e55>] __acquire_grant_for_copy+0x494/0x6ae > (XEN) [<ffff82c4c010bef0>] gnttab_copy+0x53b/0x843 > (XEN) [<ffff82c4c010e3b8>] do_grant_table_op+0x11c5/0x1b82 > (XEN) [<ffff82c4c011502f>] do_multicall+0x227/0x444 > (XEN) [<ffff82c4c0227f0b>] syscall_enter+0xeb/0x145And that''s without paging or anything similar? Else I would expect the problem to be in the interaction there. But I''ll also take a look at the grant table code assuming this is a problem even without any enhanced memory management functionality... Jan
Olaf Hering
2013-Feb-22 08:57 UTC
Re: error in xen/arch/x86/mm.c:get_page during migration
On Fri, Feb 22, Jan Beulich wrote:> And that''s without paging or anything similar? Else I would expect > the problem to be in the interaction there. But I''ll also take a look > at the grant table code assuming this is a problem even without > any enhanced memory management functionality...Nothing like this is enabled. name="domU" description="something" uuid="a062cabb-5981-4472-9d3b-da7bd8e2594e" memory=512 vcpus=2 serial="pty" builder="hvm" boot="dcn" disk=[ ''file:/some/vdisk0,hda,w'', ] vif=[ ''bridge=br0,model=rtl8139,type=netfront'' ] vfb = [ ''type=vnc,vncunused=1,keymap=de'' ] on_crash="preserve"
Jan Beulich
2013-Feb-22 14:01 UTC
Re: error in xen/arch/x86/mm.c:get_page during migration
>>> On 21.02.13 at 18:31, Olaf Hering <olaf@aepfle.de> wrote: > It did not happen with xl.But the same guest and Dom0 kernel, and the same hypervisor?> Here is the output while doing xm migrate: > > (XEN) HVM2 restore: VMCE_VCPU 0 > (XEN) HVM2 restore: VMCE_VCPU 1 > (XEN) HVM2 restore: TSC_ADJUST 0 > (XEN) HVM2 restore: TSC_ADJUST 1 > (XEN) mm.c:1983:d0 Error pfn 4112c5: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=7400000000000001Didn''t even notice yesterday that this is apparently after restore has already started. Which makes me curious whether the domain that is being referenced with rd= is the old or the new one (would require printing the domain ID; honestly I never understood what use printing of the domain pointer is). I''m also confused by the domain pointer always being the same; I would expect it to at least toggle between two values, but probably even be different between every instance of the guest. But you''re not having a stubdom configured for the guest either, according to the config you sent earlier...> (XEN) Xen call trace: > (XEN) [<ffff82c4c0170fb2>] get_page+0xfb/0x151 > (XEN) [<ffff82c4c01e1d87>] get_page_from_gfn_p2m+0x17e/0x284 > (XEN) [<ffff82c4c01098ae>] __get_paged_frame+0x5d/0x170 > (XEN) [<ffff82c4c0109e55>] __acquire_grant_for_copy+0x494/0x6ae > (XEN) [<ffff82c4c010bef0>] gnttab_copy+0x53b/0x843 > (XEN) [<ffff82c4c010e3b8>] do_grant_table_op+0x11c5/0x1b82 > (XEN) [<ffff82c4c011502f>] do_multicall+0x227/0x444 > (XEN) [<ffff82c4c0227f0b>] syscall_enter+0xeb/0x145The only user of grant copies is netback, and hence I would suppose that the failed transmit (in whichever direction) is simply being retried, thus preventing the error from becoming user visible. Jan
Olaf Hering
2013-Feb-22 20:07 UTC
Re: error in xen/arch/x86/mm.c:get_page during migration
On Fri, Feb 22, Jan Beulich wrote:> >>> On 21.02.13 at 18:31, Olaf Hering <olaf@aepfle.de> wrote: > > It did not happen with xl. > > But the same guest and Dom0 kernel, and the same hypervisor?Yes, same sles11sp2 dom0, and 3.7.9 pvops guest.> > Here is the output while doing xm migrate: > > > > (XEN) HVM2 restore: VMCE_VCPU 0 > > (XEN) HVM2 restore: VMCE_VCPU 1 > > (XEN) HVM2 restore: TSC_ADJUST 0 > > (XEN) HVM2 restore: TSC_ADJUST 1 > > (XEN) mm.c:1983:d0 Error pfn 4112c5: rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, taf=7400000000000001 > > Didn''t even notice yesterday that this is apparently after restore > has already started. Which makes me curious whether the domain > that is being referenced with rd= is the old or the new one (would > require printing the domain ID; honestly I never understood what > use printing of the domain pointer is). > > I''m also confused by the domain pointer always being the same; > I would expect it to at least toggle between two values, but > probably even be different between every instance of the guest. > But you''re not having a stubdom configured for the guest either, > according to the config you sent earlier...The rd->domain_id is DOMID_COW in both cases. Olaf
Jan Beulich
2013-Feb-25 09:34 UTC
Re: error in xen/arch/x86/mm.c:get_page during migration
>>> On 22.02.13 at 21:07, Olaf Hering <olaf@aepfle.de> wrote: > On Fri, Feb 22, Jan Beulich wrote: > >> >>> On 21.02.13 at 18:31, Olaf Hering <olaf@aepfle.de> wrote: >> > It did not happen with xl. >> >> But the same guest and Dom0 kernel, and the same hypervisor? > > Yes, same sles11sp2 dom0, and 3.7.9 pvops guest. > >> > Here is the output while doing xm migrate: >> > >> > (XEN) HVM2 restore: VMCE_VCPU 0 >> > (XEN) HVM2 restore: VMCE_VCPU 1 >> > (XEN) HVM2 restore: TSC_ADJUST 0 >> > (XEN) HVM2 restore: TSC_ADJUST 1 >> > (XEN) mm.c:1983:d0 Error pfn 4112c5: rd=ffff83036ffef000, > od=0000000000000000, caf=180000000000000, taf=7400000000000001 >> >> Didn''t even notice yesterday that this is apparently after restore >> has already started. Which makes me curious whether the domain >> that is being referenced with rd= is the old or the new one (would >> require printing the domain ID; honestly I never understood what >> use printing of the domain pointer is). >> >> I''m also confused by the domain pointer always being the same; >> I would expect it to at least toggle between two values, but >> probably even be different between every instance of the guest. >> But you''re not having a stubdom configured for the guest either, >> according to the config you sent earlier... > > The rd->domain_id is DOMID_COW in both cases.Which suggests that memory sharing is in use. At least I''m unaware of other uses of that pseudo domain. Jan
Andres Lagar-Cavilla
2013-Feb-25 14:52 UTC
Re: error in xen/arch/x86/mm.c:get_page during migration
>>>> On 22.02.13 at 21:07, Olaf Hering <olaf@aepfle.de> wrote: >> On Fri, Feb 22, Jan Beulich wrote: >> >>>>>> On 21.02.13 at 18:31, Olaf Hering <olaf@aepfle.de> wrote: >>>> It did not happen with xl. >>> >>> But the same guest and Dom0 kernel, and the same hypervisor? >> >> Yes, same sles11sp2 dom0, and 3.7.9 pvops guest. >> >>>> Here is the output while doing xm migrate: >>>> >>>> (XEN) HVM2 restore: VMCE_VCPU 0 >>>> (XEN) HVM2 restore: VMCE_VCPU 1 >>>> (XEN) HVM2 restore: TSC_ADJUST 0 >>>> (XEN) HVM2 restore: TSC_ADJUST 1 >>>> (XEN) mm.c:1983:d0 Error pfn 4112c5: rd=ffff83036ffef000, >> od=0000000000000000, caf=180000000000000, taf=7400000000000001 >>> >>> Didn''t even notice yesterday that this is apparently after restore >>> has already started. Which makes me curious whether the domain >>> that is being referenced with rd= is the old or the new one (would >>> require printing the domain ID; honestly I never understood what >>> use printing of the domain pointer is). >>> >>> I''m also confused by the domain pointer always being the same; >>> I would expect it to at least toggle between two values, but >>> probably even be different between every instance of the guest. >>> But you''re not having a stubdom configured for the guest either, >>> according to the config you sent earlier... >> >> The rd->domain_id is DOMID_COW in both cases. > > Which suggests that memory sharing is in use. At least I''m unaware > of other uses of that pseudo domain.There are none. There seems to be something else amiss though. Unless I am parsing this incorrectly, taf == PGT_writable | PGT_pae_xen_l2? And caf == PAT | PCD? Looks like a very unlikely combination Andres> > Jan
At 09:52 -0500 on 25 Feb (1361785966), Andres Lagar-Cavilla wrote:> >>>> On 22.02.13 at 21:07, Olaf Hering <olaf@aepfle.de> wrote: > >> On Fri, Feb 22, Jan Beulich wrote: > >> > >>>>>> On 21.02.13 at 18:31, Olaf Hering <olaf@aepfle.de> wrote: > >>>> It did not happen with xl. > >>> > >>> But the same guest and Dom0 kernel, and the same hypervisor? > >> > >> Yes, same sles11sp2 dom0, and 3.7.9 pvops guest. > >> > >>>> Here is the output while doing xm migrate: > >>>> > >>>> (XEN) HVM2 restore: VMCE_VCPU 0 > >>>> (XEN) HVM2 restore: VMCE_VCPU 1 > >>>> (XEN) HVM2 restore: TSC_ADJUST 0 > >>>> (XEN) HVM2 restore: TSC_ADJUST 1 > >>>> (XEN) mm.c:1983:d0 Error pfn 4112c5: rd=ffff83036ffef000, > >> od=0000000000000000, caf=180000000000000, taf=7400000000000001 > >>> > >>> Didn''t even notice yesterday that this is apparently after restore > >>> has already started. Which makes me curious whether the domain > >>> that is being referenced with rd= is the old or the new one (would > >>> require printing the domain ID; honestly I never understood what > >>> use printing of the domain pointer is). > >>> > >>> I''m also confused by the domain pointer always being the same; > >>> I would expect it to at least toggle between two values, but > >>> probably even be different between every instance of the guest. > >>> But you''re not having a stubdom configured for the guest either, > >>> according to the config you sent earlier... > >> > >> The rd->domain_id is DOMID_COW in both cases. > > > > Which suggests that memory sharing is in use. At least I''m unaware > > of other uses of that pseudo domain. > > There are none. > > There seems to be something else amiss though. Unless I am parsing > this incorrectly, taf == PGT_writable | PGT_pae_xen_l2? And caf == PAT > | PCD? Looks like a very unlikely combinationBy my reading, taf = 0x7400000000000001 = typecount 1, PGT_writable_page | PGT_validated caf = 0x0180000000000000 = refcount 0, PGC_state_free iow this is a free page but somehow has ended up with a typecount (which explains why the get_page() failed). And presumably this is one of the various get_page[_and_type](page, dom_cow) calls in mem_sharing.c. Since free_domheap_pages() has a BUG_ON(typecount != 0), it seems like something''s gone badly off the rails here. One place I can see that tinkers with typecount without holding a ref is share_xen_page_with_guest(), which sets exactly this typecount, but then calls page_set_owner(page, d). There''s some hairy code in __gnttab_map_grant_ref() too, but I _think_ it can''t end up taking typecounts without refcounts. __acquire_grant_for_copy() looks pretty hairy too, in particular this: (void)page_get_owner_and_reference(*page); but presumably the matching put_page() would have crashed if that was the problem. Does anyone understand the grant code well enough to get into that? If you can repro this, it might be worth tracing all the refcount ops into a large buffer and dumping the history of this MFN on failure. Cheers, Tim.
Olaf Hering
2013-Feb-28 13:12 UTC
Re: error in xen/arch/x86/mm.c:get_page during migration
On Thu, Feb 28, Tim Deegan wrote:> If you can repro this, it might be worth tracing all the refcount ops > into a large buffer and dumping the history of this MFN on failure.I can reproduce it with xend, will have a look next week. Olaf
Jan Beulich
2013-Feb-28 14:06 UTC
Re: error in xen/arch/x86/mm.c:get_page during migration
>>> On 28.02.13 at 12:56, Tim Deegan <tim@xen.org> wrote: > At 09:52 -0500 on 25 Feb (1361785966), Andres Lagar-Cavilla wrote: >> There seems to be something else amiss though. Unless I am parsing >> this incorrectly, taf == PGT_writable | PGT_pae_xen_l2? And caf == PAT >> | PCD? Looks like a very unlikely combination > > By my reading, > > taf = 0x7400000000000001 = typecount 1, PGT_writable_page | PGT_validated > caf = 0x0180000000000000 = refcount 0, PGC_state_freeRight.> iow this is a free page but somehow has ended up with a typecount (which > explains why the get_page() failed). And presumably this is one of the > various get_page[_and_type](page, dom_cow) calls in mem_sharing.c. > > Since free_domheap_pages() has a BUG_ON(typecount != 0), it seems like > something''s gone badly off the rails here. > > One place I can see that tinkers with typecount without holding a > ref is share_xen_page_with_guest(), which sets exactly this typecount, > but then calls page_set_owner(page, d). > > There''s some hairy code in __gnttab_map_grant_ref() too, but I _think_ > it can''t end up taking typecounts without refcounts. > > __acquire_grant_for_copy() looks pretty hairy too, in particular this: > (void)page_get_owner_and_reference(*page); > but presumably the matching put_page() would have crashed if that was > the problem. Does anyone understand the grant code well enough to get > into that?Problem is that the domain reported in the message is DOM_COW according to Olaf, yet he''s not knowingly using page sharing. But that fact pretty much excluded the grant table or any other of the "usual" code paths for me (and I never looked at the sharing code, so hoped one of you two would). Jan
At 14:06 +0000 on 28 Feb (1362060411), Jan Beulich wrote:> Problem is that the domain reported in the message is DOM_COW > according to Olaf, yet he''s not knowingly using page sharing.Oh. In that case, a backtrace (e.g. from WARN()) in the offending call would be enlightening.> But that fact pretty much excluded the grant table or any other of the > "usual" code paths for me (and I never looked at the sharing code, so > hoped one of you two would).I just looked thorough it for unguarded typecount modifications and didn''t find one. But presumably something else is wrong if we''re trying to unshare a page with no sharing enabled. Tim.
Andres Lagar-Cavilla
2013-Feb-28 14:59 UTC
Re: error in xen/arch/x86/mm.c:get_page during migration
On Feb 28, 2013, at 9:06 AM, "Jan Beulich" <JBeulich@suse.com> wrote:>>>> On 28.02.13 at 12:56, Tim Deegan <tim@xen.org> wrote: >> At 09:52 -0500 on 25 Feb (1361785966), Andres Lagar-Cavilla wrote: >>> There seems to be something else amiss though. Unless I am parsing >>> this incorrectly, taf == PGT_writable | PGT_pae_xen_l2? And caf == PAT >>> | PCD? Looks like a very unlikely combination >> >> By my reading, >> >> taf = 0x7400000000000001 = typecount 1, PGT_writable_page | PGT_validated >> caf = 0x0180000000000000 = refcount 0, PGC_state_free > > Right.D''oh :) Sharing code never sets PGT_writable, only sets/clears PGT_shared. I tend to think the dom_cow rd belies a different problem. To use dom_cow, the domain has to have an explicit domctl that enables memory sharing. Andres> >> iow this is a free page but somehow has ended up with a typecount (which >> explains why the get_page() failed). And presumably this is one of the >> various get_page[_and_type](page, dom_cow) calls in mem_sharing.c. >> >> Since free_domheap_pages() has a BUG_ON(typecount != 0), it seems like >> something''s gone badly off the rails here. >> >> One place I can see that tinkers with typecount without holding a >> ref is share_xen_page_with_guest(), which sets exactly this typecount, >> but then calls page_set_owner(page, d). >> >> There''s some hairy code in __gnttab_map_grant_ref() too, but I _think_ >> it can''t end up taking typecounts without refcounts. >> >> __acquire_grant_for_copy() looks pretty hairy too, in particular this: >> (void)page_get_owner_and_reference(*page); >> but presumably the matching put_page() would have crashed if that was >> the problem. Does anyone understand the grant code well enough to get >> into that? > > Problem is that the domain reported in the message is DOM_COW > according to Olaf, yet he''s not knowingly using page sharing. But > that fact pretty much excluded the grant table or any other of the > "usual" code paths for me (and I never looked at the sharing code, > so hoped one of you two would). > > Jan >
Jan Beulich
2013-Feb-28 15:15 UTC
Re: error in xen/arch/x86/mm.c:get_page during migration
>>> On 28.02.13 at 15:59, Andres Lagar-Cavilla <andreslc@gridcentric.ca> wrote: > On Feb 28, 2013, at 9:06 AM, "Jan Beulich" <JBeulich@suse.com> wrote: > >>>>> On 28.02.13 at 12:56, Tim Deegan <tim@xen.org> wrote: >>> At 09:52 -0500 on 25 Feb (1361785966), Andres Lagar-Cavilla wrote: >>>> There seems to be something else amiss though. Unless I am parsing >>>> this incorrectly, taf == PGT_writable | PGT_pae_xen_l2? And caf == PAT >>>> | PCD? Looks like a very unlikely combination >>> >>> By my reading, >>> >>> taf = 0x7400000000000001 = typecount 1, PGT_writable_page | PGT_validated >>> caf = 0x0180000000000000 = refcount 0, PGC_state_free >> >> Right. > D''oh :) > > Sharing code never sets PGT_writable, only sets/clears PGT_shared. I tend to > think the dom_cow rd belies a different problem. To use dom_cow, the domain > has to have an explicit domctl that enables memory sharing.In that case, Olaf, could you simply set dom_cow to some invalid pointer right after it got set up through domain_create(), so that any dereference of it would blow up? That might be the faster path towards finding the first bogus use. Jan