thr3ads.net - Xen devel - error in xen/arch/x86/mm.c:get

If this information is useful, please help other people find it:
Share via:

Olaf Hering

2013-Feb-21 14:48 UTC

error in xen/arch/x86/mm.c:get_page during migration

While doing "while xm migrate --live domU localhost;do sleep 2;done" I
see many errors from get_page:

...
(XEN) HVM56 restore: TSC_ADJUST 0
(XEN) HVM56 restore: TSC_ADJUST 1
(XEN) mm.c:1982:d0 Error pfn 41a863: rd=ffff83036ffef000, od=0000000000000000,
caf=180000000000000, taf=7400000000000001
(XEN) mm.c:1982:d0 Error pfn 41be1c: rd=ffff83036ffef000, od=0000000000000000,
caf=180000000000000, taf=7400000000000001
(XEN) mm.c:1982:d0 Error pfn 41a862: rd=ffff83036ffef000, od=0000000000000000,
caf=180000000000000, taf=7400000000000001
(XEN) mm.c:1982:d0 Error pfn 41b90f: rd=ffff83036ffef000, od=0000000000000000,
caf=180000000000000, taf=7400000000000001
(XEN) mm.c:1982:d0 Error pfn 41b49a: rd=ffff83036ffef000, od=0000000000000000,
caf=180000000000000, taf=7400000000000001
(XEN) mm.c:1982:d0 Error pfn 41b48d: rd=ffff83036ffef000, od=0000000000000000,
caf=180000000000000, taf=7400000000000001
(XEN) irq.c:375: Dom56 callback via changed to Direct Vector 0xf3
(XEN) HVM56 save: CPU
...

The pfn number and the amount of pfn differs during iterations, but in the end
only these two variants appear:

# xm dmesg | grep -w mm | cut -d : -f 4- | sort | uniq -c | sort
     22  rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000,
taf=7400000000000001
     46  rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000,
taf=0000000000000001


It does not seem to cause issues other than the log output.
Does it indiciate a real bug?

Olaf

Jan Beulich

2013-Feb-21 16:42 UTC

head link

Re: error in xen/arch/x86/mm.c:get_page during migration

>>> On 21.02.13 at 15:48, Olaf Hering <olaf@aepfle.de> wrote:
> While doing "while xm migrate --live domU localhost;do sleep
2;done" I
> see many errors from get_page:
> 
> ...
> (XEN) HVM56 restore: TSC_ADJUST 0
> (XEN) HVM56 restore: TSC_ADJUST 1
> (XEN) mm.c:1982:d0 Error pfn 41a863: rd=ffff83036ffef000, 
> od=0000000000000000, caf=180000000000000, taf=7400000000000001
> (XEN) mm.c:1982:d0 Error pfn 41be1c: rd=ffff83036ffef000, 
> od=0000000000000000, caf=180000000000000, taf=7400000000000001
> (XEN) mm.c:1982:d0 Error pfn 41a862: rd=ffff83036ffef000, 
> od=0000000000000000, caf=180000000000000, taf=7400000000000001
> (XEN) mm.c:1982:d0 Error pfn 41b90f: rd=ffff83036ffef000, 
> od=0000000000000000, caf=180000000000000, taf=7400000000000001
> (XEN) mm.c:1982:d0 Error pfn 41b49a: rd=ffff83036ffef000, 
> od=0000000000000000, caf=180000000000000, taf=7400000000000001
> (XEN) mm.c:1982:d0 Error pfn 41b48d: rd=ffff83036ffef000, 
> od=0000000000000000, caf=180000000000000, taf=7400000000000001
> (XEN) irq.c:375: Dom56 callback via changed to Direct Vector 0xf3
> (XEN) HVM56 save: CPU
> ...
> 
> The pfn number and the amount of pfn differs during iterations, but in the 
> end
> only these two variants appear:
> 
> # xm dmesg | grep -w mm | cut -d : -f 4- | sort | uniq -c | sort
>      22  rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, 
> taf=7400000000000001
>      46  rd=ffff83036ffef000, od=0000000000000000, caf=180000000000000, 
> taf=0000000000000001
> 
> 
> It does not seem to cause issues other than the log output.
> Does it indiciate a real bug?
I''m afraid it does - a non-zero type count should generally not be
accompanied by a zero general count. That''s specifically because
lone put_page_type() calls are pretty rare, and going through all
of them I don''t see anyone that could be one being outstanding
in your case.

I''m surprised this doesn''t cause an assertion to trigger
somewhere.
You are using a debug hypervisor, aren''t you?

Of course, if this truly is just a "leaked" type reference, then no
other bad consequences are to be afraid of.

What you could do to get a better understanding of when this
happens is to add a WARN_ON() alongside the printk() (perhaps
such that it triggers only once for each of the two different
cases), and then let us look at the call trace.

Jan

Olaf Hering

2013-Feb-21 17:31 UTC

head link

Re: error in xen/arch/x86/mm.c:get_page during migration

On Thu, Feb 21, Jan Beulich wrote:
> What you could do to get a better understanding of when this
> happens is to add a WARN_ON() alongside the printk() (perhaps
> such that it triggers only once for each of the two different
> cases), and then let us look at the call trace.
It did not happen with xl.

Here is the output while doing xm migrate:


(XEN) HVM2 restore: VMCE_VCPU 0
(XEN) HVM2 restore: VMCE_VCPU 1
(XEN) HVM2 restore: TSC_ADJUST 0
(XEN) HVM2 restore: TSC_ADJUST 1
(XEN) mm.c:1983:d0 Error pfn 4112c5: rd=ffff83036ffef000, od=0000000000000000,
caf=180000000000000, taf=7400000000000001
(XEN) Xen WARN at mm.c:1986
(XEN) ----[ Xen-4.3.26579-20130221.171413  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    20
(XEN) RIP:    e008:[<ffff82c4c0170fb2>] get_page+0xfb/0x151
(XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor
(XEN) rax: 7400000000000001   rbx: 7400000000000001   rcx: 0000000000000000
(XEN) rdx: 7400000000000001   rsi: 000000000000000a   rdi: ffff82c4c0280748
(XEN) rbp: ffff83036d5f7958   rsp: ffff83036d5f7908   r8:  0000000000000014
(XEN) r9:  0000000000000004   r10: 0000000000000004   r11: 0000000000000001
(XEN) r12: 0180000000000000   r13: 0000000000000000   r14: ffff83036ffef000
(XEN) r15: ffff82e0082258a0   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 000000065e78f000   cr2: ffff8805ad260040
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff83036d5f7908:
(XEN)    0000000000000000 0180000000000000 7400000000000001 ffff82c4c01e1726
(XEN)    00000000004112c5 ffff82e0082258a0 ffff83036d5f79fc ffff83036d5f799c
(XEN)    ffff830232aa49e0 ffff830232aa49e0 ffff83036d5f79c8 ffff82c4c01e1d87
(XEN)    ffff830300000000 0000000000000086 000000026d5f7a04 00000000000190c5
(XEN)    ffff830402126000 ffffffff01000086 000000076d5f7bc0 0000000000000000
(XEN)    ffff830402126000 ffff83040225dc60 00000000000190c5 ffff83036d5f7ba0
(XEN)    ffff83036d5f7a28 ffff82c4c01098ae ffff83036d5f7a98 ffff83036d5f7ba0
(XEN)    ffff83036d5f7ab8 00000000000f03f8 000000006d5f7a08 0000000000000000
(XEN)    0000000000000240 ffff83040225dc60 0000000000000000 ffff83036d5f7ba0
(XEN)    ffff83036d5f7ae8 ffff82c4c0109e55 ffff830402296200 0000000000000086
(XEN)    0000018300000009 00000000000000fd ffff83036d5f7bb0 000082e000000000
(XEN)    ffff830402126000 ffff830402203c58 00000000002337b6 ffff830402296200
(XEN)    0000000000000000 ffff830402296200 ffff830402296200 000002406d5f7ba8
(XEN)    ffff830300000000 ffff83036d5f7bb0 000000006d5f7b68 0000000000000000
(XEN)    ffff83036d5ce000 0000000000000000 0000000000000000 ffff830402126000
(XEN)    ffff83036d5f7c28 ffff82c4c010bef0 ffff83036d5f7bc4 ffff83036d5f7bc0
(XEN)    ffff830300000001 0000000000000096 ffff83036d5f7bec ffff82c4c0319820
(XEN)    ffff83036d5f0000 ffff83036d5f0000 ffff83036d5f0000 ffff83036d5f0000
(XEN)    ffff83036d5f0000 ffff83036d5f7bc8 ffff83036d5f0000 0000000000000001
(XEN)    ffffc90010283a40 0000000000000002 ffff83036d5f7bd8 ffff82c4c0125aa4
(XEN) Xen call trace:
(XEN)    [<ffff82c4c0170fb2>] get_page+0xfb/0x151
(XEN)    [<ffff82c4c01e1d87>] get_page_from_gfn_p2m+0x17e/0x284
(XEN)    [<ffff82c4c01098ae>] __get_paged_frame+0x5d/0x170
(XEN)    [<ffff82c4c0109e55>] __acquire_grant_for_copy+0x494/0x6ae
(XEN)    [<ffff82c4c010bef0>] gnttab_copy+0x53b/0x843
(XEN)    [<ffff82c4c010e3b8>] do_grant_table_op+0x11c5/0x1b82
(XEN)    [<ffff82c4c011502f>] do_multicall+0x227/0x444
(XEN)    [<ffff82c4c0227f0b>] syscall_enter+0xeb/0x145
(XEN)
(XEN) mm.c:1983:d0 Error pfn 41144d: rd=ffff83036ffef000, od=0000000000000000,
caf=180000000000000, taf=7400000000000001
(XEN) mm.c:1983:d0 Error pfn 4116b0: rd=ffff83036ffef000, od=0000000000000000,
caf=180000000000000, taf=7400000000000001
(XEN) irq.c:375: Dom2 callback via changed to Direct Vector 0xf3
(XEN) HVM2 save: CPU
...
(XEN) HVM3 restore: VMCE_VCPU 0
(XEN) HVM3 restore: VMCE_VCPU 1
(XEN) HVM3 restore: TSC_ADJUST 0
(XEN) HVM3 restore: TSC_ADJUST 1
(XEN) mm.c:1983:d0 Error pfn 43f7d4: rd=ffff83036ffef000, od=0000000000000000,
caf=180000000000000, taf=0000000000000001
(XEN) Xen WARN at mm.c:1990
(XEN) ----[ Xen-4.3.26579-20130221.171413  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    14
(XEN) RIP:    e008:[<ffff82c4c0170fdc>] get_page+0x125/0x151
(XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor
(XEN) rax: 7400000000000001   rbx: 0000000000000001   rcx: 0000000000000000
(XEN) rdx: 0000000000000001   rsi: 000000000000000a   rdi: ffff82c4c0280748
(XEN) rbp: ffff83036ff2f958   rsp: ffff83036ff2f908   r8:  000000000000000e
(XEN) r9:  0000000000000004   r10: 0000000000000004   r11: 0000000000000001
(XEN) r12: 0180000000000000   r13: 0000000000000000   r14: ffff83036ffef000
(XEN) r15: ffff82e0087efa80   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 0000000230871000   cr2: ffff8805abd77f20
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff83036ff2f908:
(XEN)    0000000000000000 0180000000000000 0000000000000001 ffff82c4c01e1726
(XEN)    000000000043f7d4 ffff82e0087efa80 ffff83036ff2f9fc ffff83036ff2f99c
(XEN)    ffff830231fa8010 ffff830231fa8010 ffff83036ff2f9c8 ffff82c4c01e1d87
(XEN)    ffff830300000000 0000000000000086 00000002900d0604 0000000000019110
(XEN)    ffff83022fb64000 ffffffff01000048 000000076ff2fa18 0000000000000000
(XEN)    ffff83022fb64000 ffff830232319db0 0000000000019110 ffff83036ff2fba0
(XEN)    ffff83036ff2fa28 ffff82c4c01098ae ffff83036ff2fa58 ffff83036ff2fba0
(XEN)    ffff83036ff2fab8 0000000000000004 0000000000000000 0000000000000000
(XEN)    0000000000000247 ffff830232319db0 0000000000000000 ffff83036ff2fba0
(XEN)    ffff83036ff2fae8 ffff82c4c0109e55 ffff83022f019238 0000000000000000
(XEN)    ffff83036ff2fad8 0000000000000004 ffff83036ff2fbb0 000082e000000000
(XEN)    ffff83022fb64000 ffff8302325a0c58 00000000002338d7 ffff83022f019238
(XEN)    0000000000000000 ffff83022f019238 ffff83022f019238 000002476ff2fba8
(XEN)    ffff830300000000 ffff8302325a0c58 000000006d5ce000 0000000000000000
(XEN)    ffff83036d5ce000 0000000000000000 0000000000000000 ffff83022fb64000
(XEN)    ffff83036ff2fc28 ffff82c4c010bef0 ffff83036ff2fbc4 ffff83036ff2fbc0
(XEN)    ffff830300000001 0000000000000096 ffff83036ff2fbec ffff82c4c0319820
(XEN)    ffff83036ff28000 ffff83036ff28000 ffff83036ff28000 ffff83036ff28000
(XEN)    ffff83036ff28000 ffff83036ff2fbc8 ffff83036ff28000 0000000000000001
(XEN)    ffffc90010283a40 0000000000000002 ffff83036ff2fbd8 ffff82c4c0125aa4
(XEN) Xen call trace:
(XEN)    [<ffff82c4c0170fdc>] get_page+0x125/0x151
(XEN)    [<ffff82c4c01e1d87>] get_page_from_gfn_p2m+0x17e/0x284
(XEN)    [<ffff82c4c01098ae>] __get_paged_frame+0x5d/0x170
(XEN)    [<ffff82c4c0109e55>] __acquire_grant_for_copy+0x494/0x6ae
(XEN)    [<ffff82c4c010bef0>] gnttab_copy+0x53b/0x843
(XEN)    [<ffff82c4c010e3b8>] do_grant_table_op+0x11c5/0x1b82
(XEN)    [<ffff82c4c011502f>] do_multicall+0x227/0x444
(XEN)    [<ffff82c4c0227f0b>] syscall_enter+0xeb/0x145
(XEN)
(XEN) mm.c:1983:d0 Error pfn 43e646: rd=ffff83036ffef000, od=0000000000000000,
caf=180000000000000, taf=0000000000000001
(XEN) mm.c:1983:d0 Error pfn 43f86a: rd=ffff83036ffef000, od=0000000000000000,
caf=180000000000000, taf=0000000000000001
(XEN) mm.c:1983:d0 Error pfn 43e683: rd=ffff83036ffef000, od=0000000000000000,
caf=180000000000000, taf=0000000000000001
(XEN) mm.c:1983:d0 Error pfn 43f31b: rd=ffff83036ffef000, od=0000000000000000,
caf=180000000000000, taf=0000000000000001
(XEN) mm.c:1983:d0 Error pfn 43e5f0: rd=ffff83036ffef000, od=0000000000000000,
caf=180000000000000, taf=0000000000000001
(XEN) mm.c:1983:d0 Error pfn 43f3b7: rd=ffff83036ffef000, od=0000000000000000,
caf=180000000000000, taf=7400000000000001
(XEN) mm.c:1983:d0 Error pfn 43f87c: rd=ffff83036ffef000, od=0000000000000000,
caf=180000000000000, taf=7400000000000001
(XEN) irq.c:375: Dom3 callback via changed to Direct Vector 0xf3
...

Olaf

Jan Beulich

2013-Feb-22 07:42 UTC

head link

Re: error in xen/arch/x86/mm.c:get_page during migration

>>> On 21.02.13 at 18:31, Olaf Hering <olaf@aepfle.de> wrote:
> On Thu, Feb 21, Jan Beulich wrote:
> 
>> What you could do to get a better understanding of when this
>> happens is to add a WARN_ON() alongside the printk() (perhaps
>> such that it triggers only once for each of the two different
>> cases), and then let us look at the call trace.
> 
> It did not happen with xl.
Odd.
> Here is the output while doing xm migrate:
> ...
> (XEN) Xen call trace:
> (XEN)    [<ffff82c4c0170fb2>] get_page+0xfb/0x151
> (XEN)    [<ffff82c4c01e1d87>] get_page_from_gfn_p2m+0x17e/0x284
> (XEN)    [<ffff82c4c01098ae>] __get_paged_frame+0x5d/0x170
> (XEN)    [<ffff82c4c0109e55>] __acquire_grant_for_copy+0x494/0x6ae
> (XEN)    [<ffff82c4c010bef0>] gnttab_copy+0x53b/0x843
> (XEN)    [<ffff82c4c010e3b8>] do_grant_table_op+0x11c5/0x1b82
> (XEN)    [<ffff82c4c011502f>] do_multicall+0x227/0x444
> (XEN)    [<ffff82c4c0227f0b>] syscall_enter+0xeb/0x145
> ...
> (XEN) Xen call trace:
> (XEN)    [<ffff82c4c0170fdc>] get_page+0x125/0x151
> (XEN)    [<ffff82c4c01e1d87>] get_page_from_gfn_p2m+0x17e/0x284
> (XEN)    [<ffff82c4c01098ae>] __get_paged_frame+0x5d/0x170
> (XEN)    [<ffff82c4c0109e55>] __acquire_grant_for_copy+0x494/0x6ae
> (XEN)    [<ffff82c4c010bef0>] gnttab_copy+0x53b/0x843
> (XEN)    [<ffff82c4c010e3b8>] do_grant_table_op+0x11c5/0x1b82
> (XEN)    [<ffff82c4c011502f>] do_multicall+0x227/0x444
> (XEN)    [<ffff82c4c0227f0b>] syscall_enter+0xeb/0x145
And that''s without paging or anything similar? Else I would expect
the problem to be in the interaction there. But I''ll also take a look
at the grant table code assuming this is a problem even without
any enhanced memory management functionality...

Jan

Olaf Hering

2013-Feb-22 08:57 UTC

head link

Re: error in xen/arch/x86/mm.c:get_page during migration

On Fri, Feb 22, Jan Beulich wrote:
> And that''s without paging or anything similar? Else I would expect
> the problem to be in the interaction there. But I''ll also take a
look
> at the grant table code assuming this is a problem even without
> any enhanced memory management functionality...
Nothing like this is enabled.


name="domU"
description="something"
uuid="a062cabb-5981-4472-9d3b-da7bd8e2594e"
memory=512
vcpus=2
serial="pty"
builder="hvm"
boot="dcn"
disk=[
        ''file:/some/vdisk0,hda,w'',
]
vif=[
        ''bridge=br0,model=rtl8139,type=netfront''
]
vfb = [
        ''type=vnc,vncunused=1,keymap=de''
]
on_crash="preserve"

Jan Beulich

2013-Feb-22 14:01 UTC

head link

Re: error in xen/arch/x86/mm.c:get_page during migration

>>> On 21.02.13 at 18:31, Olaf Hering <olaf@aepfle.de> wrote:
> It did not happen with xl.
But the same guest and Dom0 kernel, and the same hypervisor?
> Here is the output while doing xm migrate:
> 
> (XEN) HVM2 restore: VMCE_VCPU 0
> (XEN) HVM2 restore: VMCE_VCPU 1
> (XEN) HVM2 restore: TSC_ADJUST 0
> (XEN) HVM2 restore: TSC_ADJUST 1
> (XEN) mm.c:1983:d0 Error pfn 4112c5: rd=ffff83036ffef000,
od=0000000000000000, caf=180000000000000, taf=7400000000000001
Didn''t even notice yesterday that this is apparently after restore
has already started. Which makes me curious whether the domain
that is being referenced with rd= is the old or the new one (would
require printing the domain ID; honestly I never understood what
use printing of the domain pointer is).

I''m also confused by the domain pointer always being the same;
I would expect it to at least toggle between two values, but
probably even be different between every instance of the guest.
But you''re not having a stubdom configured for the guest either,
according to the config you sent earlier...
> (XEN) Xen call trace:
> (XEN)    [<ffff82c4c0170fb2>] get_page+0xfb/0x151
> (XEN)    [<ffff82c4c01e1d87>] get_page_from_gfn_p2m+0x17e/0x284
> (XEN)    [<ffff82c4c01098ae>] __get_paged_frame+0x5d/0x170
> (XEN)    [<ffff82c4c0109e55>] __acquire_grant_for_copy+0x494/0x6ae
> (XEN)    [<ffff82c4c010bef0>] gnttab_copy+0x53b/0x843
> (XEN)    [<ffff82c4c010e3b8>] do_grant_table_op+0x11c5/0x1b82
> (XEN)    [<ffff82c4c011502f>] do_multicall+0x227/0x444
> (XEN)    [<ffff82c4c0227f0b>] syscall_enter+0xeb/0x145
The only user of grant copies is netback, and hence I would suppose
that the failed transmit (in whichever direction) is simply being retried,
thus preventing the error from becoming user visible.

Jan

Olaf Hering

2013-Feb-22 20:07 UTC

head link

Re: error in xen/arch/x86/mm.c:get_page during migration

On Fri, Feb 22, Jan Beulich wrote:
> >>> On 21.02.13 at 18:31, Olaf Hering <olaf@aepfle.de>
wrote:
> > It did not happen with xl.
> 
> But the same guest and Dom0 kernel, and the same hypervisor?
Yes, same sles11sp2 dom0, and 3.7.9 pvops guest.
> > Here is the output while doing xm migrate:
> > 
> > (XEN) HVM2 restore: VMCE_VCPU 0
> > (XEN) HVM2 restore: VMCE_VCPU 1
> > (XEN) HVM2 restore: TSC_ADJUST 0
> > (XEN) HVM2 restore: TSC_ADJUST 1
> > (XEN) mm.c:1983:d0 Error pfn 4112c5: rd=ffff83036ffef000,
od=0000000000000000, caf=180000000000000, taf=7400000000000001
> 
> Didn''t even notice yesterday that this is apparently after restore
> has already started. Which makes me curious whether the domain
> that is being referenced with rd= is the old or the new one (would
> require printing the domain ID; honestly I never understood what
> use printing of the domain pointer is).
> 
> I''m also confused by the domain pointer always being the same;
> I would expect it to at least toggle between two values, but
> probably even be different between every instance of the guest.
> But you''re not having a stubdom configured for the guest either,
> according to the config you sent earlier...
The rd->domain_id is DOMID_COW in both cases.


Olaf

Jan Beulich

2013-Feb-25 09:34 UTC

head link

Re: error in xen/arch/x86/mm.c:get_page during migration

>>> On 22.02.13 at 21:07, Olaf Hering <olaf@aepfle.de> wrote:
> On Fri, Feb 22, Jan Beulich wrote:
> 
>> >>> On 21.02.13 at 18:31, Olaf Hering <olaf@aepfle.de>
wrote:
>> > It did not happen with xl.
>> 
>> But the same guest and Dom0 kernel, and the same hypervisor?
> 
> Yes, same sles11sp2 dom0, and 3.7.9 pvops guest.
> 
>> > Here is the output while doing xm migrate:
>> > 
>> > (XEN) HVM2 restore: VMCE_VCPU 0
>> > (XEN) HVM2 restore: VMCE_VCPU 1
>> > (XEN) HVM2 restore: TSC_ADJUST 0
>> > (XEN) HVM2 restore: TSC_ADJUST 1
>> > (XEN) mm.c:1983:d0 Error pfn 4112c5: rd=ffff83036ffef000, 
> od=0000000000000000, caf=180000000000000, taf=7400000000000001
>> 
>> Didn''t even notice yesterday that this is apparently after
restore
>> has already started. Which makes me curious whether the domain
>> that is being referenced with rd= is the old or the new one (would
>> require printing the domain ID; honestly I never understood what
>> use printing of the domain pointer is).
>> 
>> I''m also confused by the domain pointer always being the same;
>> I would expect it to at least toggle between two values, but
>> probably even be different between every instance of the guest.
>> But you''re not having a stubdom configured for the guest
either,
>> according to the config you sent earlier...
> 
> The rd->domain_id is DOMID_COW in both cases.
Which suggests that memory sharing is in use. At least I''m unaware
of other uses of that pseudo domain.

Jan

Andres Lagar-Cavilla

2013-Feb-25 14:52 UTC

head link

Re: error in xen/arch/x86/mm.c:get_page during migration

>>>> On 22.02.13 at 21:07, Olaf Hering <olaf@aepfle.de> wrote:
>> On Fri, Feb 22, Jan Beulich wrote:
>> 
>>>>>> On 21.02.13 at 18:31, Olaf Hering
<olaf@aepfle.de> wrote:
>>>> It did not happen with xl.
>>> 
>>> But the same guest and Dom0 kernel, and the same hypervisor?
>> 
>> Yes, same sles11sp2 dom0, and 3.7.9 pvops guest.
>> 
>>>> Here is the output while doing xm migrate:
>>>> 
>>>> (XEN) HVM2 restore: VMCE_VCPU 0
>>>> (XEN) HVM2 restore: VMCE_VCPU 1
>>>> (XEN) HVM2 restore: TSC_ADJUST 0
>>>> (XEN) HVM2 restore: TSC_ADJUST 1
>>>> (XEN) mm.c:1983:d0 Error pfn 4112c5: rd=ffff83036ffef000, 
>> od=0000000000000000, caf=180000000000000, taf=7400000000000001
>>> 
>>> Didn''t even notice yesterday that this is apparently after
restore
>>> has already started. Which makes me curious whether the domain
>>> that is being referenced with rd= is the old or the new one (would
>>> require printing the domain ID; honestly I never understood what
>>> use printing of the domain pointer is).
>>> 
>>> I''m also confused by the domain pointer always being the
same;
>>> I would expect it to at least toggle between two values, but
>>> probably even be different between every instance of the guest.
>>> But you''re not having a stubdom configured for the guest
either,
>>> according to the config you sent earlier...
>> 
>> The rd->domain_id is DOMID_COW in both cases.
> 
> Which suggests that memory sharing is in use. At least I''m unaware
> of other uses of that pseudo domain.
There are none.

There seems to be something else amiss though. Unless I am parsing this
incorrectly, taf == PGT_writable | PGT_pae_xen_l2? And caf == PAT | PCD? Looks
like a very unlikely combination

Andres> 
> Jan

Tim Deegan

2013-Feb-28 11:56 UTC

head link

Re: error in xen/arch/x86/mm.c:get_page during migration

At 09:52 -0500 on 25 Feb (1361785966), Andres Lagar-Cavilla
wrote:> >>>> On 22.02.13 at 21:07, Olaf Hering <olaf@aepfle.de>
wrote:
> >> On Fri, Feb 22, Jan Beulich wrote:
> >> 
> >>>>>> On 21.02.13 at 18:31, Olaf Hering
<olaf@aepfle.de> wrote:
> >>>> It did not happen with xl.
> >>> 
> >>> But the same guest and Dom0 kernel, and the same hypervisor?
> >> 
> >> Yes, same sles11sp2 dom0, and 3.7.9 pvops guest.
> >> 
> >>>> Here is the output while doing xm migrate:
> >>>> 
> >>>> (XEN) HVM2 restore: VMCE_VCPU 0
> >>>> (XEN) HVM2 restore: VMCE_VCPU 1
> >>>> (XEN) HVM2 restore: TSC_ADJUST 0
> >>>> (XEN) HVM2 restore: TSC_ADJUST 1
> >>>> (XEN) mm.c:1983:d0 Error pfn 4112c5: rd=ffff83036ffef000, 
> >> od=0000000000000000, caf=180000000000000, taf=7400000000000001
> >>> 
> >>> Didn''t even notice yesterday that this is apparently
after restore
> >>> has already started. Which makes me curious whether the domain
> >>> that is being referenced with rd= is the old or the new one
(would
> >>> require printing the domain ID; honestly I never understood
what
> >>> use printing of the domain pointer is).
> >>> 
> >>> I''m also confused by the domain pointer always being
the same;
> >>> I would expect it to at least toggle between two values, but
> >>> probably even be different between every instance of the
guest.
> >>> But you''re not having a stubdom configured for the
guest either,
> >>> according to the config you sent earlier...
> >> 
> >> The rd->domain_id is DOMID_COW in both cases.
> > 
> > Which suggests that memory sharing is in use. At least I''m
unaware
> > of other uses of that pseudo domain.
> 
> There are none.
> 
> There seems to be something else amiss though. Unless I am parsing
> this incorrectly, taf == PGT_writable | PGT_pae_xen_l2? And caf == PAT
> | PCD? Looks like a very unlikely combination
By my reading, 

taf = 0x7400000000000001 = typecount 1, PGT_writable_page | PGT_validated
caf = 0x0180000000000000 = refcount 0, PGC_state_free

iow this is a free page but somehow has ended up with a typecount (which
explains why the get_page() failed).  And presumably this is one of the
various get_page[_and_type](page, dom_cow) calls in mem_sharing.c.

Since free_domheap_pages() has a BUG_ON(typecount != 0), it seems like
something''s gone badly off the rails here. 

One place I can see that tinkers with typecount without holding a
ref is share_xen_page_with_guest(), which sets exactly this typecount,
but then calls page_set_owner(page, d).

There''s some hairy code in __gnttab_map_grant_ref() too, but I _think_
it can''t end up taking typecounts without refcounts.

__acquire_grant_for_copy() looks pretty hairy too, in particular this:
        (void)page_get_owner_and_reference(*page);
 but presumably the matching put_page() would have crashed if that was
the problem.  Does anyone understand the grant code well enough to get
into that?

If you can repro this, it might be worth tracing all the refcount ops
into a large buffer and dumping the history of this MFN on failure.

Cheers,

Tim.

Olaf Hering

2013-Feb-28 13:12 UTC

head link

Re: error in xen/arch/x86/mm.c:get_page during migration

On Thu, Feb 28, Tim Deegan wrote:
> If you can repro this, it might be worth tracing all the refcount ops
> into a large buffer and dumping the history of this MFN on failure.
I can reproduce it with xend, will have a look next week.

Olaf

Jan Beulich

2013-Feb-28 14:06 UTC

head link

Re: error in xen/arch/x86/mm.c:get_page during migration

>>> On 28.02.13 at 12:56, Tim Deegan <tim@xen.org> wrote:
> At 09:52 -0500 on 25 Feb (1361785966), Andres Lagar-Cavilla wrote:
>> There seems to be something else amiss though. Unless I am parsing
>> this incorrectly, taf == PGT_writable | PGT_pae_xen_l2? And caf == PAT
>> | PCD? Looks like a very unlikely combination
> 
> By my reading, 
> 
> taf = 0x7400000000000001 = typecount 1, PGT_writable_page | PGT_validated
> caf = 0x0180000000000000 = refcount 0, PGC_state_free
Right.
> iow this is a free page but somehow has ended up with a typecount (which
> explains why the get_page() failed).  And presumably this is one of the
> various get_page[_and_type](page, dom_cow) calls in mem_sharing.c.
> 
> Since free_domheap_pages() has a BUG_ON(typecount != 0), it seems like
> something''s gone badly off the rails here. 
> 
> One place I can see that tinkers with typecount without holding a
> ref is share_xen_page_with_guest(), which sets exactly this typecount,
> but then calls page_set_owner(page, d).
> 
> There''s some hairy code in __gnttab_map_grant_ref() too, but I
_think_
> it can''t end up taking typecounts without refcounts.
> 
> __acquire_grant_for_copy() looks pretty hairy too, in particular this:
>         (void)page_get_owner_and_reference(*page);
>  but presumably the matching put_page() would have crashed if that was
> the problem.  Does anyone understand the grant code well enough to get
> into that?
Problem is that the domain reported in the message is DOM_COW
according to Olaf, yet he''s not knowingly using page sharing. But
that fact pretty much excluded the grant table or any other of the
"usual" code paths for me (and I never looked at the sharing code,
so hoped one of you two would).

Jan

Tim Deegan

2013-Feb-28 14:13 UTC

head link

Re: error in xen/arch/x86/mm.c:get_page during migration

At 14:06 +0000 on 28 Feb (1362060411), Jan Beulich
wrote:> Problem is that the domain reported in the message is DOM_COW
> according to Olaf, yet he''s not knowingly using page sharing.
Oh.  In that case, a backtrace (e.g. from WARN()) in the offending call
would be enlightening. 
> But that fact pretty much excluded the grant table or any other of the
> "usual" code paths for me (and I never looked at the sharing
code, so
> hoped one of you two would).
I just looked thorough it for unguarded typecount modifications and
didn''t find one.  But presumably something else is wrong if
we''re
trying to unshare a page with no sharing enabled.

Tim.

Andres Lagar-Cavilla

2013-Feb-28 14:59 UTC

head link

Re: error in xen/arch/x86/mm.c:get_page during migration

On Feb 28, 2013, at 9:06 AM, "Jan Beulich" <JBeulich@suse.com>
wrote:
>>>> On 28.02.13 at 12:56, Tim Deegan <tim@xen.org> wrote:
>> At 09:52 -0500 on 25 Feb (1361785966), Andres Lagar-Cavilla wrote:
>>> There seems to be something else amiss though. Unless I am parsing
>>> this incorrectly, taf == PGT_writable | PGT_pae_xen_l2? And caf ==
PAT
>>> | PCD? Looks like a very unlikely combination
>> 
>> By my reading, 
>> 
>> taf = 0x7400000000000001 = typecount 1, PGT_writable_page |
PGT_validated
>> caf = 0x0180000000000000 = refcount 0, PGC_state_free
> 
> Right.D''oh :)

Sharing code never sets PGT_writable, only sets/clears PGT_shared. I tend to
think the dom_cow rd belies a different problem. To use dom_cow, the domain has
to have an explicit domctl that enables memory sharing.

Andres> 
>> iow this is a free page but somehow has ended up with a typecount
(which
>> explains why the get_page() failed).  And presumably this is one of the
>> various get_page[_and_type](page, dom_cow) calls in mem_sharing.c.
>> 
>> Since free_domheap_pages() has a BUG_ON(typecount != 0), it seems like
>> something''s gone badly off the rails here. 
>> 
>> One place I can see that tinkers with typecount without holding a
>> ref is share_xen_page_with_guest(), which sets exactly this typecount,
>> but then calls page_set_owner(page, d).
>> 
>> There''s some hairy code in __gnttab_map_grant_ref() too, but I
_think_
>> it can''t end up taking typecounts without refcounts.
>> 
>> __acquire_grant_for_copy() looks pretty hairy too, in particular this:
>>        (void)page_get_owner_and_reference(*page);
>> but presumably the matching put_page() would have crashed if that was
>> the problem.  Does anyone understand the grant code well enough to get
>> into that?
> 
> Problem is that the domain reported in the message is DOM_COW
> according to Olaf, yet he''s not knowingly using page sharing. But
> that fact pretty much excluded the grant table or any other of the
> "usual" code paths for me (and I never looked at the sharing
code,
> so hoped one of you two would).
> 
> Jan
>

Jan Beulich

2013-Feb-28 15:15 UTC

head link

Re: error in xen/arch/x86/mm.c:get_page during migration

>>> On 28.02.13 at 15:59, Andres Lagar-Cavilla
<andreslc@gridcentric.ca> wrote:
> On Feb 28, 2013, at 9:06 AM, "Jan Beulich"
<JBeulich@suse.com> wrote:
> 
>>>>> On 28.02.13 at 12:56, Tim Deegan <tim@xen.org> wrote:
>>> At 09:52 -0500 on 25 Feb (1361785966), Andres Lagar-Cavilla wrote:
>>>> There seems to be something else amiss though. Unless I am
parsing
>>>> this incorrectly, taf == PGT_writable | PGT_pae_xen_l2? And caf
== PAT
>>>> | PCD? Looks like a very unlikely combination
>>> 
>>> By my reading, 
>>> 
>>> taf = 0x7400000000000001 = typecount 1, PGT_writable_page |
PGT_validated
>>> caf = 0x0180000000000000 = refcount 0, PGC_state_free
>> 
>> Right.
> D''oh :)
> 
> Sharing code never sets PGT_writable, only sets/clears PGT_shared. I tend
to
> think the dom_cow rd belies a different problem. To use dom_cow, the domain
> has to have an explicit domctl that enables memory sharing.
In that case, Olaf, could you simply set dom_cow to some invalid
pointer right after it got set up through domain_create(), so that
any dereference of it would blow up? That might be the faster path
towards finding the first bogus use.

Jan

Xen devel - Feb 2013 - error in xen/arch/x86/mm.c:get_page during migration

error in xen/arch/x86/mm.c:get_page during migration

Re: error in xen/arch/x86/mm.c:get_page during migration

Re: error in xen/arch/x86/mm.c:get_page during migration

Re: error in xen/arch/x86/mm.c:get_page during migration

Re: error in xen/arch/x86/mm.c:get_page during migration

Re: error in xen/arch/x86/mm.c:get_page during migration

Re: error in xen/arch/x86/mm.c:get_page during migration

Re: error in xen/arch/x86/mm.c:get_page during migration

Re: error in xen/arch/x86/mm.c:get_page during migration

Re: error in xen/arch/x86/mm.c:get_page during migration

Re: error in xen/arch/x86/mm.c:get_page during migration

Re: error in xen/arch/x86/mm.c:get_page during migration

Re: error in xen/arch/x86/mm.c:get_page during migration

Re: error in xen/arch/x86/mm.c:get_page during migration

Re: error in xen/arch/x86/mm.c:get_page during migration