thr3ads.net - Xen devel - Xen 4.3 + tmem = Xen BUG at domain

If this information is useful, please help other people find it:
Share via:

konrad wilk

2013-Jun-11 13:45 UTC

Xen 4.3 + tmem = Xen BUG at domain_page.c:143

This is a fairly simple test and it does work with Xen 4.2.


# xl info
host                   : tst035.dumpdata.com
release                : 3.10.0-rc5upstream-00438-g335262d-dirty
version       3f00:17bae3ff:00000000:00000001:00000000
virt_caps              : hvm hvm_directio
total_memory           : 8016
free_memory            : 5852
sharing_freed_memory   : 0
sharing_used_memory    : 0
outstanding_claims     : 0
free_cpus              : 0
xen_major              : 4
xen_minor              : 3
xen_extra              : -unstable
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 
hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : Mon Jun 10 14:42:51 2013 +0200 git:44434f3-dirty
xen_commandline        : com1=115200,8n1 tmem=1 dom0_mem=max:2G 
cpufreq=xen:performance,verbose noreboot console=com1,vga loglvl=all 
guest_loglvl=all
cc_compiler            : gcc (GCC) 4.4.4 20100503 (Red Hat 4.4.4-2)
cc_compile_by          : konrad
cc_compile_domain      : (none)
cc_compile_date        : Mon Jun 10 17:01:43 EDT 2013
xend_config_format     : 4
#
   133.475684] xen-blkback:(backend_changed:585) .
Jun 11 13:39:58 tst035 logger: /etc/xen/scripts/block: add 
XENBUS_PATH=backend/vbd/1/51712
[  133.477018] xen-blkback:(xen_vbd_create:421) Successful creation of 
handle=ca00 (dom=1)
[  133.477018] .
[  133.479632] xen-blkback:(frontend_changed:665) Initialising.
mapping kernel into physical memory
about to get started...
Jun 11 13:39:59 tst035 logger: /etc/xen/scripts/vif-bridge: online 
type_if=vif XENBUS_PATH=backend/vif/1/0
[  133.635819] device vif1.0 entered promiscuous mode
[  133.639363] IPv6: ADDRCONF(NETDEV_UP): vif1.0: link is not ready
Jun 11 13:39:59 tst035 logger: /etc/xen/scripts/vif-bridge: Successful 
vif-bridge online for vif1.0, bridge switch.
Jun 11 13:39:59 tst035 logger: /etc/xen/scripts/vif-bridge: Writing 
backend/vif/1/0/hotplug-status connected to xenstore.
[  135.864732] IPv6: ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready
[  135.865760] switch: port 2(vif1.0) entered forward[  135.965777] 
xen-blkback:(frontend_changed:665) Initialised.
[  135.966711] xen-blkback:(connect_ring:820) /local/domain/1/d 
persistent grants
[  135.968942] xen-blkback:(connect:734) /local/domain/1/device/vbd/51712.
[  135.981089] xen-blkback:(frontend_changed:665) Connected.
... snip..
[  140.441073] xen-blkback: grant 38 added to the tree of persistent 
grants, using 28/1056
[  140.441640] xen-blkback: grant 39 added to the tree of persistent 
grants, using 29/1056
[  140.442284] xen-blkback: grant 40 added to the tree of persistent 
grants, using 30/1056
[  140.442840] xen-blkback: grant 41 added to the tree of persistent 
grants, using 31/1056
[  140.443389] xen-blkback: grant 42 added to the tree of persistent 
grants, using 32/1056
[  140.443920] xen-blkback: grant 43 added to the tree of persistent 
grants, using 33/1056
[  140.444449] xen-blkback: grant 44 added to the tree of persistent 
grants, using 34/1056
(XEN) tmem: initializing tmem capability for domid=1...<G><2>ok
(XEN) tmem: allocating persistent-private tmem pool for 
domid=1...<G><2>pool_id=0
[  150.879132] switch: port 2(vif1.0) entered forwarding state

(XEN) Xen BUG at domain_page.c:143
(XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c4c0160461>] map_domain_page+0x450/0x514
(XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
(XEN) rax: 0000000000000020   rbx: ffff8300c68f9000   rcx: 0000000000000000
(XEN) rdx: 0000000000000020   rsi: 0000000000000020   rdi: 0000000000000000
(XEN) rbp: ffff82c4c02c7cc8   rsp: ffff82c4c02c7c88   r8: ffff820060001000
(XEN) r9:  00000000ffffffff   r10: ffff820060006000   r11: 0000000000000000
(XEN) r12: ffff83022e1bb000   r13: 00000000001ebcdc   r14: 0000000000000020
(XEN) r15: 0000000000000004   cr0: 0000000080050033   cr4: 00000000000426f0
(XEN) cr3: 0000000209541000   cr2: ffff88002b683fd0
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff82c4c02c7c88:
(XEN)    ffff83022e1bb2d8 0000000000000286 ffff82c4c012760a ffff83022e1bb000
(XEN)    ffff82e003d79b80 ffff82c4c02c7d60 00000000001ebcdc 0000000000000000
(XEN)    ffff82c4c02c7d38 ffff82c4c01373de ffff82c4c0127b6b ffffffffffffffff
(XEN)    00000000c02c7d38 ffff82c4c02c7d58 ffff83022e1bb2d8 0000000000000286
(XEN)    0000000000000027 0000000000000000 0000000000001000 0000000000000000
(XEN)    0000000000000000 00000000001ebcdc ffff82c4c02c7d98 ffff82c4c01377c4
(XEN)    0000000000000000 ffff820040014000 ffff82e003d79b80 00000000001ebcdc
(XEN)    ffff82c4c02c7d98 ffff830210ecf390 00000000fffffff4 ffff820040010010
(XEN)    ffff82004001cf50 ffff83022e1bcc90 ffff82c4c02c7e18 ffff82c4c0135929
(XEN)    ffff82c4c02c7db8 ffff82004001cf50 0000000000000000 00000000001ebcdc
(XEN)    0000000000000000 0000000000000000 0000e8a200000000 ffff82c4c02c7e00
(XEN)    ffff82c4c02c7e18 ffff83022e1bcc90 ffff830210ecf390 0000000000000000
(XEN)    0000000000000001 000000000000009a ffff82c4c02c7ef8 ffff82c4c0136510
(XEN)    0000002700001000 0000000000000000 ffff82c4c02c7e90 97c4284effffffc2
(XEN)    ffff82c4c02c7e68 ffff82c4c015719d ffff82c4c0127b09 0000000000000000
(XEN)    ffff82c4c02c7e88 ffff82c4c018c13c ffff82c4c0319100 ffff82c4c02c7f18
(XEN)    0000000000000004 0000000000000001 0000000000000000 0000000000000000
(XEN)    000000000000e8a2 0000000000000000 00000000001ebcdc 000000000000e030
(XEN)    0000000000000246 ffff8300c68f9000 0000000000000000 0000000000000000
(XEN)    0000000000000001 0000000000000000 00007d3b3fd380c7 ffff82c4c02236db
(XEN) Xen call trace:
(XEN)    [<ffff82c4c0160461>] map_domain_page+0x450/0x514
(XEN)    [<ffff82c4c01373de>] cli_get_page+0x15e/0x17b
(XEN)    [<ffff82c4c01377c4>] tmh_copy_from_client+0x150/0x284
(XEN)    [<ffff82c4c0135929>] do_tmem_put+0x323/0x5c4
(XEN)    [<ffff82c4c0136510>] do_tmem_op+0x5a0/0xbd0
(XEN)    [<ffff82c4c02236db>] syscall_enter+0xeb/0x145
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Xen BUG at domain_page.c:143
(XEN) ****************************************
(XEN)
(XEN) Manual reset required (''noreboot'' specified)

Jan Beulich

2013-Jun-11 14:46 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

>>> On 11.06.13 at 15:45, konrad wilk <konrad.wilk@oracle.com>
wrote:
> This is a fairly simple test and it does work with Xen 4.2.
> [...]
> (XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Not tainted ]----
> (XEN) CPU:    0
> (XEN) RIP:    e008:[<ffff82c4c0160461>] map_domain_page+0x450/0x514
> (XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
> (XEN) rax: 0000000000000020   rbx: ffff8300c68f9000   rcx: 0000000000000000
> (XEN) rdx: 0000000000000020   rsi: 0000000000000020   rdi: 0000000000000000
> (XEN) rbp: ffff82c4c02c7cc8   rsp: ffff82c4c02c7c88   r8: ffff820060001000
> (XEN) r9:  00000000ffffffff   r10: ffff820060006000   r11: 0000000000000000
> (XEN) r12: ffff83022e1bb000   r13: 00000000001ebcdc   r14: 0000000000000020
> (XEN) r15: 0000000000000004   cr0: 0000000080050033   cr4: 00000000000426f0
> (XEN) cr3: 0000000209541000   cr2: ffff88002b683fd0
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
> (XEN) Xen stack trace from rsp=ffff82c4c02c7c88:
> (XEN)    ffff83022e1bb2d8 0000000000000286 ffff82c4c012760a
ffff83022e1bb000
> (XEN)    ffff82e003d79b80 ffff82c4c02c7d60 00000000001ebcdc
0000000000000000
> (XEN)    ffff82c4c02c7d38 ffff82c4c01373de ffff82c4c0127b6b
ffffffffffffffff
> (XEN)    00000000c02c7d38 ffff82c4c02c7d58 ffff83022e1bb2d8
0000000000000286
> (XEN)    0000000000000027 0000000000000000 0000000000001000
0000000000000000
> (XEN)    0000000000000000 00000000001ebcdc ffff82c4c02c7d98
ffff82c4c01377c4
> (XEN)    0000000000000000 ffff820040014000 ffff82e003d79b80
00000000001ebcdc
> (XEN)    ffff82c4c02c7d98 ffff830210ecf390 00000000fffffff4
ffff820040010010
> (XEN)    ffff82004001cf50 ffff83022e1bcc90 ffff82c4c02c7e18
ffff82c4c0135929
> (XEN)    ffff82c4c02c7db8 ffff82004001cf50 0000000000000000
00000000001ebcdc
> (XEN)    0000000000000000 0000000000000000 0000e8a200000000
ffff82c4c02c7e00
> (XEN)    ffff82c4c02c7e18 ffff83022e1bcc90 ffff830210ecf390
0000000000000000
> (XEN)    0000000000000001 000000000000009a ffff82c4c02c7ef8
ffff82c4c0136510
> (XEN)    0000002700001000 0000000000000000 ffff82c4c02c7e90
97c4284effffffc2
> (XEN)    ffff82c4c02c7e68 ffff82c4c015719d ffff82c4c0127b09
0000000000000000
> (XEN)    ffff82c4c02c7e88 ffff82c4c018c13c ffff82c4c0319100
ffff82c4c02c7f18
> (XEN)    0000000000000004 0000000000000001 0000000000000000
0000000000000000
> (XEN)    000000000000e8a2 0000000000000000 00000000001ebcdc
000000000000e030
> (XEN)    0000000000000246 ffff8300c68f9000 0000000000000000
0000000000000000
> (XEN)    0000000000000001 0000000000000000 00007d3b3fd380c7
ffff82c4c02236db
> (XEN) Xen call trace:
> (XEN)    [<ffff82c4c0160461>] map_domain_page+0x450/0x514
> (XEN)    [<ffff82c4c01373de>] cli_get_page+0x15e/0x17b
> (XEN)    [<ffff82c4c01377c4>] tmh_copy_from_client+0x150/0x284
> (XEN)    [<ffff82c4c0135929>] do_tmem_put+0x323/0x5c4
> (XEN)    [<ffff82c4c0136510>] do_tmem_op+0x5a0/0xbd0
> (XEN)    [<ffff82c4c02236db>] syscall_enter+0xeb/0x145
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 0:
> (XEN) Xen BUG at domain_page.c:143
> (XEN) ****************************************
For one, you won''t see this with debug=n.

And then it seems quite likely that tmem shows behavior that I''m
unaware of, and hence I may have broken it with the 16Tb
support patches - I suspect it simply drives the hypervisor out of
domain page mapping resources. After all, x86-64 didn''t do any
such mapping yet in 4.2.

But tmem being unsupported due to the still pending security
audit makes this a low priority issue anyway. And as you may
or may not recall, it is being disabled for systems with more
than 5Tb too. So quite a bit of work on the tmem side...

Jan

konrad wilk

2013-Jun-11 15:30 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

On 6/11/2013 10:46 AM, Jan Beulich wrote:>>>> On 11.06.13 at 15:45, konrad wilk
<konrad.wilk@oracle.com> wrote:
>> This is a fairly simple test and it does work with Xen 4.2.
>> [...]
>> (XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Not tainted ]----
>> (XEN) CPU:    0
>> (XEN) RIP:    e008:[<ffff82c4c0160461>]
map_domain_page+0x450/0x514
>> (XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
>> (XEN) rax: 0000000000000020   rbx: ffff8300c68f9000   rcx:
0000000000000000
>> (XEN) rdx: 0000000000000020   rsi: 0000000000000020   rdi:
0000000000000000
>> (XEN) rbp: ffff82c4c02c7cc8   rsp: ffff82c4c02c7c88   r8:
ffff820060001000
>> (XEN) r9:  00000000ffffffff   r10: ffff820060006000   r11:
0000000000000000
>> (XEN) r12: ffff83022e1bb000   r13: 00000000001ebcdc   r14:
0000000000000020
>> (XEN) r15: 0000000000000004   cr0: 0000000080050033   cr4:
00000000000426f0
>> (XEN) cr3: 0000000209541000   cr2: ffff88002b683fd0
>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
>> (XEN) Xen stack trace from rsp=ffff82c4c02c7c88:
>> (XEN)    ffff83022e1bb2d8 0000000000000286 ffff82c4c012760a
ffff83022e1bb000
>> (XEN)    ffff82e003d79b80 ffff82c4c02c7d60 00000000001ebcdc
0000000000000000
>> (XEN)    ffff82c4c02c7d38 ffff82c4c01373de ffff82c4c0127b6b
ffffffffffffffff
>> (XEN)    00000000c02c7d38 ffff82c4c02c7d58 ffff83022e1bb2d8
0000000000000286
>> (XEN)    0000000000000027 0000000000000000 0000000000001000
0000000000000000
>> (XEN)    0000000000000000 00000000001ebcdc ffff82c4c02c7d98
ffff82c4c01377c4
>> (XEN)    0000000000000000 ffff820040014000 ffff82e003d79b80
00000000001ebcdc
>> (XEN)    ffff82c4c02c7d98 ffff830210ecf390 00000000fffffff4
ffff820040010010
>> (XEN)    ffff82004001cf50 ffff83022e1bcc90 ffff82c4c02c7e18
ffff82c4c0135929
>> (XEN)    ffff82c4c02c7db8 ffff82004001cf50 0000000000000000
00000000001ebcdc
>> (XEN)    0000000000000000 0000000000000000 0000e8a200000000
ffff82c4c02c7e00
>> (XEN)    ffff82c4c02c7e18 ffff83022e1bcc90 ffff830210ecf390
0000000000000000
>> (XEN)    0000000000000001 000000000000009a ffff82c4c02c7ef8
ffff82c4c0136510
>> (XEN)    0000002700001000 0000000000000000 ffff82c4c02c7e90
97c4284effffffc2
>> (XEN)    ffff82c4c02c7e68 ffff82c4c015719d ffff82c4c0127b09
0000000000000000
>> (XEN)    ffff82c4c02c7e88 ffff82c4c018c13c ffff82c4c0319100
ffff82c4c02c7f18
>> (XEN)    0000000000000004 0000000000000001 0000000000000000
0000000000000000
>> (XEN)    000000000000e8a2 0000000000000000 00000000001ebcdc
000000000000e030
>> (XEN)    0000000000000246 ffff8300c68f9000 0000000000000000
0000000000000000
>> (XEN)    0000000000000001 0000000000000000 00007d3b3fd380c7
ffff82c4c02236db
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82c4c0160461>] map_domain_page+0x450/0x514
>> (XEN)    [<ffff82c4c01373de>] cli_get_page+0x15e/0x17b
>> (XEN)    [<ffff82c4c01377c4>] tmh_copy_from_client+0x150/0x284
>> (XEN)    [<ffff82c4c0135929>] do_tmem_put+0x323/0x5c4
>> (XEN)    [<ffff82c4c0136510>] do_tmem_op+0x5a0/0xbd0
>> (XEN)    [<ffff82c4c02236db>] syscall_enter+0xeb/0x145
>> (XEN)
>> (XEN)
>> (XEN) ****************************************
>> (XEN) Panic on CPU 0:
>> (XEN) Xen BUG at domain_page.c:143
>> (XEN) ****************************************
> For one, you won''t see this with debug=n.
>
> And then it seems quite likely that tmem shows behavior that I''m
> unaware of, and hence I may have broken it with the 16Tb
> support patches - I suspect it simply drives the hypervisor out of
> domain page mapping resources. After all, x86-64 didn''t do any
> such mapping yet in 4.2.
I think this is a more subtle bug.
I applied a debug patch (see attached) and with the help of it and the logs:

(XEN) domain_page.c:160:d1 mfn (1ebe96) -> 6 idx: 32(i:1,j:0), branch:1
(XEN) domain_page.c:166:d1 [0] idx=26, mfn=0x1ebcd8, refcnt: 0
(XEN) domain_page.c:166:d1 [1] idx=12, mfn=0x1ebcd9, refcnt: 0
(XEN) domain_page.c:166:d1 [2] idx=2, mfn=0x210e9a, refcnt: 0
(XEN) domain_page.c:166:d1 [3] idx=14, mfn=0x210e9b, refcnt: 0
(XEN) domain_page.c:166:d1 [4] idx=7, mfn=0x210e9c, refcnt: 0
(XEN) domain_page.c:166:d1 [5] idx=10, mfn=0x210e9d, refcnt: 0
(XEN) domain_page.c:166:d1 [6] idx=5, mfn=0x210e9e, refcnt: 0
(XEN) domain_page.c:166:d1 [7] idx=13, mfn=0x1ebe97, refcnt: 0
(XEN) Xen BUG at domain_page.c:169
(XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    3
(XEN) RIP:    e008:[<ffff82c4c01606a7>] map_domain_page+0x61d/0x6e1
(XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: ffff8300c68f9000   rcx: 0000000000000000
(XEN) rdx: ffff8302125b2020   rsi: 000000000000000a   rdi: ffff82c4c027a6e8
(XEN) rbp: ffff8302125afcc8   rsp: ffff8302125afc48   r8: 0000000000000004
(XEN) r9:  0000000000000004   r10: 0000000000000004   r11: 0000000000000001
(XEN) r12: ffff83022e2ef000   r13: 00000000001ebe96   r14: 0000000000000020
(XEN) r15: ffff8300c68f9080   cr0: 0000000080050033   cr4: 00000000000426f0
(XEN) cr3: 0000000209541000   cr2: ffffffffff600400
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff8302125afc48:
(XEN)    00000000001ebe97 0000000000000000 0000000000000000 ffff830200000001
(XEN)    ffff8302125afcc8 ffff82c400000000 00000000001ebe97 000000080000000d
(XEN)    ffff83022e2ef2d8 0000000000000286 ffff82c4c0127b6b ffff83022e2ef000
(XEN)    ffff82e003d7d2c0 ffff8302125afd60 00000000001ebe96 0000000000000000
(XEN)    ffff8302125afd38 ffff82c4c01373de 0000000000000000 ffffffffffffffff
(XEN)    0000000000000001 ffff8302125afd58 ffff83022e2ef2d8 0000000000000286
(XEN)    0000000000000027 0000000000000000 0000000000001000 0000000000000000
(XEN)    0000000000000000 00000000001ebe96 ffff8302125afd98 ffff82c4c01377c4
(XEN)    0000000000000000 ffff820040017000 ffff82e003d7d2c0 00000000001ebe96
(XEN)    ffff8302125afd98 ffff830210ecf390 00000000fffffff4 ffff820040009010
(XEN)    ffff820040000f50 ffff83022e2f0c90 ffff8302125afe18 ffff82c4c0135929
(XEN)    000000160000001e ffff820040000f50 0000000000000000 00000000001ebe96
(XEN)    0000000000000000 0000000000000000 0000a2f6125afe28 ffff8302125afe00
(XEN)    0000001675f02b51 ffff83022e2f0c90 ffff830210ecf390 0000000000000000
(XEN)    0000000000000001 0000000000000065 ffff8302125afef8 ffff82c4c0136510
(XEN)    ffff830200001000 0000000000000000 ffff8302125afe90 255ece02125b2040
(XEN)    00000003125afe68 00000016742667d1 ffff8302125b2100 0000003d52299000
(XEN)    ffff8300c68f9000 0000000001c9c380 ffff8302125b2100 ffff8302125b1808
(XEN)    0000000000000004 0000000000000004 0000000000000000 0000000000000000
(XEN)    000000000000a2f6 0000000000000000 00000000001ebe96 ffff82c4c0126e77
(XEN) Xen call trace:
(XEN)    [<ffff82c4c01606a7>] map_domain_page+0x61d/0x6e1
(XEN)    [<ffff82c4c01373de>] cli_get_page+0x15e/0x17b
(XEN)    [<ffff82c4c01377c4>] tmh_copy_from_client+0x150/0x284
(XEN)    [<ffff82c4c0135929>] do_tmem_put+0x323/0x5c4
(XEN)    [<ffff82c4c0136510>] do_tmem_op+0x5a0/0xbd0
(XEN)    [<ffff82c4c022391b>] syscall_enter+0xeb/0x145
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 3:
(XEN) Xen BUG at domain_page.c:169
(XEN) ****************************************
(XEN)
(XEN) Manual reset required (''noreboot'' specified)

It looks as if the path that is taken is:

110     idx = find_next_zero_bit(dcache->inuse, dcache->entries, 
dcache->cursor);
111     if ( unlikely(idx >= dcache->entries) )
112     {

115         /* /First/, clean the garbage map and update the inuse list. */
116         for ( i = 0; i < BITS_TO_LONGS(dcache->entries); i++ )
117         {
118             dcache->inuse[i] &= ~xchg(&dcache->garbage[i], 0);
119             accum |= ~dcache->inuse[i];

Here computes the accum
120         }
121
122         if ( accum )
123             idx = find_first_zero_bit(dcache->inuse, dcache->entries)

Ok, finds the idx (32),
124         else
125         {
.. does not go here.
142         }
143         BUG_ON(idx >= dcache->entries);

And hits the BUG_ON().

But I am not sure if that is appropriate. Perhaps the BUG_ON was meant 
as a check
for the loop (lines 128 ->  141) - in case it looped around and never 
found an empty place.
But if that is the condition then that would also look suspect as it 
might have found an
empty hash entry and the idx would still end up being 32.

>
> But tmem being unsupported due to the still pending security
> audit makes this a low priority issue anyway. And as you mayI think the discussion on whether a fix should go in Xen 4.3 (or the 
stable releases)
is something we can discuss when a patch has been
fabricated.> or may not recall, it is being disabled for systems with more
> than 5Tb too. So quite a bit of work on the tmem side...
>
> Jan
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

George Dunlap

2013-Jun-11 15:56 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

On Tue, Jun 11, 2013 at 4:30 PM, konrad wilk <konrad.wilk@oracle.com>
wrote:> I think this is a more subtle bug.
> I applied a debug patch (see attached) and with the help of it and the
logs:
>
> (XEN) domain_page.c:160:d1 mfn (1ebe96) -> 6 idx: 32(i:1,j:0), branch:1
> (XEN) domain_page.c:166:d1 [0] idx=26, mfn=0x1ebcd8, refcnt: 0
> (XEN) domain_page.c:166:d1 [1] idx=12, mfn=0x1ebcd9, refcnt: 0
> (XEN) domain_page.c:166:d1 [2] idx=2, mfn=0x210e9a, refcnt: 0
> (XEN) domain_page.c:166:d1 [3] idx=14, mfn=0x210e9b, refcnt: 0
> (XEN) domain_page.c:166:d1 [4] idx=7, mfn=0x210e9c, refcnt: 0
> (XEN) domain_page.c:166:d1 [5] idx=10, mfn=0x210e9d, refcnt: 0
> (XEN) domain_page.c:166:d1 [6] idx=5, mfn=0x210e9e, refcnt: 0
> (XEN) domain_page.c:166:d1 [7] idx=13, mfn=0x1ebe97, refcnt: 0
> (XEN) Xen BUG at domain_page.c:169
>
> (XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Not tainted ]----
> (XEN) CPU:    3
> (XEN) RIP:    e008:[<ffff82c4c01606a7>] map_domain_page+0x61d/0x6e1
>
> (XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
> (XEN) rax: 0000000000000000   rbx: ffff8300c68f9000   rcx: 0000000000000000
> (XEN) rdx: ffff8302125b2020   rsi: 000000000000000a   rdi: ffff82c4c027a6e8
> (XEN) rbp: ffff8302125afcc8   rsp: ffff8302125afc48   r8: 0000000000000004
> (XEN) r9:  0000000000000004   r10: 0000000000000004   r11: 0000000000000001
> (XEN) r12: ffff83022e2ef000   r13: 00000000001ebe96   r14: 0000000000000020
> (XEN) r15: ffff8300c68f9080   cr0: 0000000080050033   cr4: 00000000000426f0
> (XEN) cr3: 0000000209541000   cr2: ffffffffff600400
>
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
> (XEN) Xen stack trace from rsp=ffff8302125afc48:
> (XEN)    00000000001ebe97 0000000000000000 0000000000000000
ffff830200000001
> (XEN)    ffff8302125afcc8 ffff82c400000000 00000000001ebe97
000000080000000d
> (XEN)    ffff83022e2ef2d8 0000000000000286 ffff82c4c0127b6b
ffff83022e2ef000
> (XEN)    ffff82e003d7d2c0 ffff8302125afd60 00000000001ebe96
0000000000000000
> (XEN)    ffff8302125afd38 ffff82c4c01373de 0000000000000000
ffffffffffffffff
> (XEN)    0000000000000001 ffff8302125afd58 ffff83022e2ef2d8
0000000000000286
>
> (XEN)    0000000000000027 0000000000000000 0000000000001000
0000000000000000
> (XEN)    0000000000000000 00000000001ebe96 ffff8302125afd98
ffff82c4c01377c4
> (XEN)    0000000000000000 ffff820040017000 ffff82e003d7d2c0
00000000001ebe96
> (XEN)    ffff8302125afd98 ffff830210ecf390 00000000fffffff4
ffff820040009010
> (XEN)    ffff820040000f50 ffff83022e2f0c90 ffff8302125afe18
ffff82c4c0135929
> (XEN)    000000160000001e ffff820040000f50 0000000000000000
00000000001ebe96
> (XEN)    0000000000000000 0000000000000000 0000a2f6125afe28
ffff8302125afe00
> (XEN)    0000001675f02b51 ffff83022e2f0c90 ffff830210ecf390
0000000000000000
> (XEN)    0000000000000001 0000000000000065 ffff8302125afef8
ffff82c4c0136510
> (XEN)    ffff830200001000 0000000000000000 ffff8302125afe90
255ece02125b2040
> (XEN)    00000003125afe68 00000016742667d1 ffff8302125b2100
0000003d52299000
> (XEN)    ffff8300c68f9000 0000000001c9c380 ffff8302125b2100
ffff8302125b1808
> (XEN)    0000000000000004 0000000000000004 0000000000000000
0000000000000000
> (XEN)    000000000000a2f6 0000000000000000 00000000001ebe96
ffff82c4c0126e77
> (XEN) Xen call trace:
> (XEN)    [<ffff82c4c01606a7>] map_domain_page+0x61d/0x6e1
>
> (XEN)    [<ffff82c4c01373de>] cli_get_page+0x15e/0x17b
> (XEN)    [<ffff82c4c01377c4>] tmh_copy_from_client+0x150/0x284
> (XEN)    [<ffff82c4c0135929>] do_tmem_put+0x323/0x5c4
> (XEN)    [<ffff82c4c0136510>] do_tmem_op+0x5a0/0xbd0
> (XEN)    [<ffff82c4c022391b>] syscall_enter+0xeb/0x145
>
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 3:
> (XEN) Xen BUG at domain_page.c:169
>
> (XEN) ****************************************
> (XEN)
> (XEN) Manual reset required (''noreboot'' specified)
>
> It looks as if the path that is taken is:
>
> 110     idx = find_next_zero_bit(dcache->inuse, dcache->entries,
> dcache->cursor);
> 111     if ( unlikely(idx >= dcache->entries) )
> 112     {
>
> 115         /* /First/, clean the garbage map and update the inuse list. */
> 116         for ( i = 0; i < BITS_TO_LONGS(dcache->entries); i++ )
> 117         {
> 118             dcache->inuse[i] &=
~xchg(&dcache->garbage[i], 0);
> 119             accum |= ~dcache->inuse[i];
>
> Here computes the accum
> 120         }
> 121
> 122         if ( accum )
> 123             idx = find_first_zero_bit(dcache->inuse,
dcache->entries)
>
> Ok, finds the idx (32),
> 124         else
> 125         {
> .. does not go here.
> 142         }
> 143         BUG_ON(idx >= dcache->entries);
>
> And hits the BUG_ON().
>
> But I am not sure if that is appropriate. Perhaps the BUG_ON was meant as a
> check
> for the loop (lines 128 ->  141) - in case it looped around and never
found
> an empty place.
> But if that is the condition then that would also look suspect as it might
> have found an
> empty hash entry and the idx would still end up being 32.
Right -- it is really curious that "accum |= ~dcache->inuse[x]"
managed to be non-zero, while find_first_zero_bit() goes off the end
(as it seems).

It seems like you should add a printk in the first loop:
   if(~dcache->inuse[i]) printk(...);

Also, I don''t think you''ve printed what dcache->entries is
-- is it 32?

 -George

Jan Beulich

2013-Jun-11 16:38 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

>>> On 11.06.13 at 17:30, konrad wilk <konrad.wilk@oracle.com>
wrote:
> I think this is a more subtle bug.
> I applied a debug patch (see attached) and with the help of it and the
logs:
> 
> (XEN) domain_page.c:160:d1 mfn (1ebe96) -> 6 idx: 32(i:1,j:0), branch:1
> (XEN) domain_page.c:166:d1 [0] idx=26, mfn=0x1ebcd8, refcnt: 0
> (XEN) domain_page.c:166:d1 [1] idx=12, mfn=0x1ebcd9, refcnt: 0
> (XEN) domain_page.c:166:d1 [2] idx=2, mfn=0x210e9a, refcnt: 0
> (XEN) domain_page.c:166:d1 [3] idx=14, mfn=0x210e9b, refcnt: 0
> (XEN) domain_page.c:166:d1 [4] idx=7, mfn=0x210e9c, refcnt: 0
> (XEN) domain_page.c:166:d1 [5] idx=10, mfn=0x210e9d, refcnt: 0
> (XEN) domain_page.c:166:d1 [6] idx=5, mfn=0x210e9e, refcnt: 0
> (XEN) domain_page.c:166:d1 [7] idx=13, mfn=0x1ebe97, refcnt: 0
> (XEN) Xen BUG at domain_page.c:169
> (XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Not tainted ]----
> (XEN) CPU:    3
> (XEN) RIP:    e008:[<ffff82c4c01606a7>] map_domain_page+0x61d/0x6e1
> (XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
> (XEN) rax: 0000000000000000   rbx: ffff8300c68f9000   rcx: 0000000000000000
> (XEN) rdx: ffff8302125b2020   rsi: 000000000000000a   rdi: ffff82c4c027a6e8
> (XEN) rbp: ffff8302125afcc8   rsp: ffff8302125afc48   r8: 0000000000000004
> (XEN) r9:  0000000000000004   r10: 0000000000000004   r11: 0000000000000001
> (XEN) r12: ffff83022e2ef000   r13: 00000000001ebe96   r14: 0000000000000020
> (XEN) r15: ffff8300c68f9080   cr0: 0000000080050033   cr4: 00000000000426f0
> (XEN) cr3: 0000000209541000   cr2: ffffffffff600400
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
> (XEN) Xen stack trace from rsp=ffff8302125afc48:
> (XEN)    00000000001ebe97 0000000000000000 0000000000000000
ffff830200000001
> (XEN)    ffff8302125afcc8 ffff82c400000000 00000000001ebe97
000000080000000d
> (XEN)    ffff83022e2ef2d8 0000000000000286 ffff82c4c0127b6b
ffff83022e2ef000
> (XEN)    ffff82e003d7d2c0 ffff8302125afd60 00000000001ebe96
0000000000000000
> (XEN)    ffff8302125afd38 ffff82c4c01373de 0000000000000000
ffffffffffffffff
> (XEN)    0000000000000001 ffff8302125afd58 ffff83022e2ef2d8
0000000000000286
> (XEN)    0000000000000027 0000000000000000 0000000000001000
0000000000000000
> (XEN)    0000000000000000 00000000001ebe96 ffff8302125afd98
ffff82c4c01377c4
> (XEN)    0000000000000000 ffff820040017000 ffff82e003d7d2c0
00000000001ebe96
> (XEN)    ffff8302125afd98 ffff830210ecf390 00000000fffffff4
ffff820040009010
> (XEN)    ffff820040000f50 ffff83022e2f0c90 ffff8302125afe18
ffff82c4c0135929
> (XEN)    000000160000001e ffff820040000f50 0000000000000000
00000000001ebe96
> (XEN)    0000000000000000 0000000000000000 0000a2f6125afe28
ffff8302125afe00
> (XEN)    0000001675f02b51 ffff83022e2f0c90 ffff830210ecf390
0000000000000000
> (XEN)    0000000000000001 0000000000000065 ffff8302125afef8
ffff82c4c0136510
> (XEN)    ffff830200001000 0000000000000000 ffff8302125afe90
255ece02125b2040
> (XEN)    00000003125afe68 00000016742667d1 ffff8302125b2100
0000003d52299000
> (XEN)    ffff8300c68f9000 0000000001c9c380 ffff8302125b2100
ffff8302125b1808
> (XEN)    0000000000000004 0000000000000004 0000000000000000
0000000000000000
> (XEN)    000000000000a2f6 0000000000000000 00000000001ebe96
ffff82c4c0126e77
> (XEN) Xen call trace:
> (XEN)    [<ffff82c4c01606a7>] map_domain_page+0x61d/0x6e1
> (XEN)    [<ffff82c4c01373de>] cli_get_page+0x15e/0x17b
> (XEN)    [<ffff82c4c01377c4>] tmh_copy_from_client+0x150/0x284
> (XEN)    [<ffff82c4c0135929>] do_tmem_put+0x323/0x5c4
> (XEN)    [<ffff82c4c0136510>] do_tmem_op+0x5a0/0xbd0
> (XEN)    [<ffff82c4c022391b>] syscall_enter+0xeb/0x145
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 3:
> (XEN) Xen BUG at domain_page.c:169
> (XEN) ****************************************
> (XEN)
> (XEN) Manual reset required (''noreboot'' specified)
> 
> It looks as if the path that is taken is:
> 
> 110     idx = find_next_zero_bit(dcache->inuse, dcache->entries, 
> dcache->cursor);
> 111     if ( unlikely(idx >= dcache->entries) )
> 112     {
> 
> 115         /* /First/, clean the garbage map and update the inuse list. */
> 116         for ( i = 0; i < BITS_TO_LONGS(dcache->entries); i++ )
> 117         {
> 118             dcache->inuse[i] &=
~xchg(&dcache->garbage[i], 0);
> 119             accum |= ~dcache->inuse[i];
> 
> Here computes the accum
> 120         }
> 121
> 122         if ( accum )
> 123             idx = find_first_zero_bit(dcache->inuse,
dcache->entries)
> 
> Ok, finds the idx (32),
> 124         else
> 125         {
> .. does not go here.
> 142         }
> 143         BUG_ON(idx >= dcache->entries);
> 
> And hits the BUG_ON().
> 
> But I am not sure if that is appropriate. Perhaps the BUG_ON was meant 
> as a check
> for the loop (lines 128 ->  141) - in case it looped around and never 
> found an empty place.
> But if that is the condition then that would also look suspect as it 
> might have found an
> empty hash entry and the idx would still end up being 32.
The BUG_ON() here is definitely valid - a few lines down, after the
enclosing if(), we use it in ways that requires this to not have
triggered. It basically tells you whether an in range idx was found,
which apparently isn''t the case here.

As I think George already pointed out - printing accum here would
be quite useful: It should have at least one of the low 32 bits set,
given that dcache->entries must be at most 32 according to the
data you already got logged.

Jan

konrad wilk

2013-Jun-11 17:30 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

On 6/11/2013 12:38 PM, Jan Beulich wrote:>>>> On 11.06.13 at 17:30, konrad wilk
<konrad.wilk@oracle.com> wrote:
>> I think this is a more subtle bug.
>> I applied a debug patch (see attached) and with the help of it and the
logs:
>>
>> (XEN) domain_page.c:160:d1 mfn (1ebe96) -> 6 idx: 32(i:1,j:0),
branch:1
>> (XEN) domain_page.c:166:d1 [0] idx=26, mfn=0x1ebcd8, refcnt: 0
>> (XEN) domain_page.c:166:d1 [1] idx=12, mfn=0x1ebcd9, refcnt: 0
>> (XEN) domain_page.c:166:d1 [2] idx=2, mfn=0x210e9a, refcnt: 0
>> (XEN) domain_page.c:166:d1 [3] idx=14, mfn=0x210e9b, refcnt: 0
>> (XEN) domain_page.c:166:d1 [4] idx=7, mfn=0x210e9c, refcnt: 0
>> (XEN) domain_page.c:166:d1 [5] idx=10, mfn=0x210e9d, refcnt: 0
>> (XEN) domain_page.c:166:d1 [6] idx=5, mfn=0x210e9e, refcnt: 0
>> (XEN) domain_page.c:166:d1 [7] idx=13, mfn=0x1ebe97, refcnt: 0
>> (XEN) Xen BUG at domain_page.c:169
>> (XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Not tainted ]----
>> (XEN) CPU:    3
>> (XEN) RIP:    e008:[<ffff82c4c01606a7>]
map_domain_page+0x61d/0x6e1
>> (XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
>> (XEN) rax: 0000000000000000   rbx: ffff8300c68f9000   rcx:
0000000000000000
>> (XEN) rdx: ffff8302125b2020   rsi: 000000000000000a   rdi:
ffff82c4c027a6e8
>> (XEN) rbp: ffff8302125afcc8   rsp: ffff8302125afc48   r8:
0000000000000004
>> (XEN) r9:  0000000000000004   r10: 0000000000000004   r11:
0000000000000001
>> (XEN) r12: ffff83022e2ef000   r13: 00000000001ebe96   r14:
0000000000000020
>> (XEN) r15: ffff8300c68f9080   cr0: 0000000080050033   cr4:
00000000000426f0
>> (XEN) cr3: 0000000209541000   cr2: ffffffffff600400
>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
>> (XEN) Xen stack trace from rsp=ffff8302125afc48:
>> (XEN)    00000000001ebe97 0000000000000000 0000000000000000
ffff830200000001
>> (XEN)    ffff8302125afcc8 ffff82c400000000 00000000001ebe97
000000080000000d
>> (XEN)    ffff83022e2ef2d8 0000000000000286 ffff82c4c0127b6b
ffff83022e2ef000
>> (XEN)    ffff82e003d7d2c0 ffff8302125afd60 00000000001ebe96
0000000000000000
>> (XEN)    ffff8302125afd38 ffff82c4c01373de 0000000000000000
ffffffffffffffff
>> (XEN)    0000000000000001 ffff8302125afd58 ffff83022e2ef2d8
0000000000000286
>> (XEN)    0000000000000027 0000000000000000 0000000000001000
0000000000000000
>> (XEN)    0000000000000000 00000000001ebe96 ffff8302125afd98
ffff82c4c01377c4
>> (XEN)    0000000000000000 ffff820040017000 ffff82e003d7d2c0
00000000001ebe96
>> (XEN)    ffff8302125afd98 ffff830210ecf390 00000000fffffff4
ffff820040009010
>> (XEN)    ffff820040000f50 ffff83022e2f0c90 ffff8302125afe18
ffff82c4c0135929
>> (XEN)    000000160000001e ffff820040000f50 0000000000000000
00000000001ebe96
>> (XEN)    0000000000000000 0000000000000000 0000a2f6125afe28
ffff8302125afe00
>> (XEN)    0000001675f02b51 ffff83022e2f0c90 ffff830210ecf390
0000000000000000
>> (XEN)    0000000000000001 0000000000000065 ffff8302125afef8
ffff82c4c0136510
>> (XEN)    ffff830200001000 0000000000000000 ffff8302125afe90
255ece02125b2040
>> (XEN)    00000003125afe68 00000016742667d1 ffff8302125b2100
0000003d52299000
>> (XEN)    ffff8300c68f9000 0000000001c9c380 ffff8302125b2100
ffff8302125b1808
>> (XEN)    0000000000000004 0000000000000004 0000000000000000
0000000000000000
>> (XEN)    000000000000a2f6 0000000000000000 00000000001ebe96
ffff82c4c0126e77
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82c4c01606a7>] map_domain_page+0x61d/0x6e1
>> (XEN)    [<ffff82c4c01373de>] cli_get_page+0x15e/0x17b
>> (XEN)    [<ffff82c4c01377c4>] tmh_copy_from_client+0x150/0x284
>> (XEN)    [<ffff82c4c0135929>] do_tmem_put+0x323/0x5c4
>> (XEN)    [<ffff82c4c0136510>] do_tmem_op+0x5a0/0xbd0
>> (XEN)    [<ffff82c4c022391b>] syscall_enter+0xeb/0x145
>> (XEN)
>> (XEN)
>> (XEN) ****************************************
>> (XEN) Panic on CPU 3:
>> (XEN) Xen BUG at domain_page.c:169
>> (XEN) ****************************************
>> (XEN)
>> (XEN) Manual reset required (''noreboot'' specified)
>>
>> It looks as if the path that is taken is:
>>
>> 110     idx = find_next_zero_bit(dcache->inuse, dcache->entries,
>> dcache->cursor);
>> 111     if ( unlikely(idx >= dcache->entries) )
>> 112     {
>>
>> 115         /* /First/, clean the garbage map and update the inuse
list. */
>> 116         for ( i = 0; i < BITS_TO_LONGS(dcache->entries); i++
)
>> 117         {
>> 118             dcache->inuse[i] &=
~xchg(&dcache->garbage[i], 0);
>> 119             accum |= ~dcache->inuse[i];
>>
>> Here computes the accum
>> 120         }
>> 121
>> 122         if ( accum )
>> 123             idx = find_first_zero_bit(dcache->inuse,
dcache->entries)
>>
>> Ok, finds the idx (32),
>> 124         else
>> 125         {
>> .. does not go here.
>> 142         }
>> 143         BUG_ON(idx >= dcache->entries);
>>
>> And hits the BUG_ON().
>>
>> But I am not sure if that is appropriate. Perhaps the BUG_ON was meant
>> as a check
>> for the loop (lines 128 ->  141) - in case it looped around and
never
>> found an empty place.
>> But if that is the condition then that would also look suspect as it
>> might have found an
>> empty hash entry and the idx would still end up being 32.
> The BUG_ON() here is definitely valid - a few lines down, after the
> enclosing if(), we use it in ways that requires this to not have
> triggered. It basically tells you whether an in range idx was found,
> which apparently isn''t the case here.
>
> As I think George already pointed out - printing accum here would
> be quite useful: It should have at least one of the low 32 bits set,
> given that dcache->entries must be at most 32 according to the
> data you already got logged.
Of course, here is the new log (and the debug attachment)

(XEN) domain_page.c:122:d1 [0]: ffffffff, idx: 32
(XEN) domain_page.c:167:d1 mfn (1eba98) -> 0 idx: 32(i:1,j:0), branch:9 
0xffffffff00000000
(XEN) domain_page.c:173:d1 [0] idx=0, mfn=0x182790, refcnt: 0
(XEN) domain_page.c:173:d1 [1] idx=29, mfn=0x1946f9, refcnt: 0
(XEN) domain_page.c:173:d1 [2] idx=15, mfn=0x1946fa, refcnt: 0
(XEN) domain_page.c:173:d1 [3] idx=11, mfn=0x1946fb, refcnt: 0
(XEN) domain_page.c:173:d1 [4] idx=17, mfn=0x1946fc, refcnt: 0
(XEN) domain_page.c:173:d1 [5] idx=21, mfn=0x1946fd, refcnt: 0
(XEN) domain_page.c:173:d1 [6] idx=10, mfn=0x180296, refcnt: 0
(XEN) domain_page.c:173:d1 [7] idx=4, mfn=0x180297, refcnt: 0
(XEN) Xen BUG at domain_page.c:176
(XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    3
(XEN) RIP:    e008:[<ffff82c4c0160742>] map_domain_page+0x6b8/0x77c
(XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: ffff8300c68f9000   rcx: 0000000000000000
(XEN) rdx: ffff83020e84c020   rsi: 000000000000000a   rdi: ffff82c4c027a6e8
(XEN) rbp: ffff83020e847cc8   rsp: ffff83020e847c28   r8: 0000000000000004
(XEN) r9:  0000000000000004   r10: 0000000000000004   r11: 0000000000000001
(XEN) r12: ffff83022d815000   r13: 00000000001eba98   r14: 0000000000000020
(XEN) r15: ffff8300c68f9080   cr0: 0000000080050033   cr4: 00000000000426f0
(XEN) cr3: 000000019c644000   cr2: ffff88000ef124b0
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff83020e847c28:
(XEN)    0000000000180297 0000000000000000 ffff830200000000 0000000000000009
(XEN)    ffffffff00000000 ffff82c4c0116542 0000000000000297 ffffffff00000000
(XEN)    ffff830200000020 00000000c01714f6 0000000000180297 0000000800000004
(XEN)    ffff83022d8152d8 0000000000000286 ffff82c4c012760a ffff83022d815000
(XEN)    ffff82e003d75300 ffff83020e847d60 00000000001eba98 0000000000000000
(XEN)    ffff83020e847d38 ffff82c4c01373de 0000000000000000 ffffffffffffffff
(XEN)    0000000000000001 ffff83020e847d58 ffff83022d8152d8 0000000000000286
(XEN)    0000000000000027 0000000000000000 0000000000001000 0000000000000000
(XEN)    0000000000000000 00000000001eba98 ffff83020e847d98 ffff82c4c01377c4
(XEN)    0000000000000000 ffff82004001a000 ffff82e003d75300 00000000001eba98
(XEN)    ffff83020e847d98 ffff83020354f390 00000000fffffff4 ffff820040002010
(XEN)    ffff820040001580 ffff83022d816c90 ffff83020e847e18 ffff82c4c0135929
(XEN)    ffff83020e847db8 ffff820040001580 0000000000000000 00000000001eba98
(XEN)    0000000000000000 0000000000000000 000001f200000000 ffff83020e847e00
(XEN)    ffff83020e847e18 ffff83022d816c90 ffff83020354f390 0000000000000000
(XEN)    0000000000000001 0000000000000091 ffff83020e847ef8 ffff82c4c0136510
(XEN)    ffff830200001000 0000000000000000 ffff83020e847e90 bbbc0ca3c027bba0
(XEN)    ffff82c4c027bba0 ffff82c4c02e0000 0000000000000002 ffff83020e847e78
(XEN)    ffff82c4c0127b09 ffff82c4c027bba0 ffff83020e847e98 ffff82c4c01299af
(XEN)    0000000000000004 0000000000000005 0000000000000000 0000000000000000
(XEN) Xen call trace:
(XEN)    [<ffff82c4c0160742>] map_domain_page+0x6b8/0x77c
(XEN)    [<ffff82c4c01373de>] cli_get_page+0x15e/0x17b
(XEN)    [<ffff82c4c01377c4>] tmh_copy_from_client+0x150/0x284
(XEN)    [<ffff82c4c0135929>] do_tmem_put+0x323/0x5c4
(XEN)    [<ffff82c4c0136510>] do_tmem_op+0x5a0/0xbd0
(XEN)    [<ffff82c4c02239bb>] syscall_enter+0xeb/0x145
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 3:
(XEN) Xen BUG at domain_page.c:176
(XEN) ****************************************
(XEN)
(XEN) Manual reset required (''noreboot'' specified)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

konrad wilk

2013-Jun-11 18:52 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

> The BUG_ON() here is definitely valid - a few lines down, after the
> enclosing if(), we use it in ways that requires this to not have
> triggered. It basically tells you whether an in range idx was found,
> which apparently isn''t the case here.
>
> As I think George already pointed out - printing accum here would
> be quite useful: It should have at least one of the low 32 bits set,
> given that dcache->entries must be at most 32 according to the
> data you already got logged.
With extra debugging (see attached patch)

(XEN) domain_page.c:125:d1 mfn: 1eb483, [0]: bffff1ff, 
~ffffffff40000e00, idx: 9 garbage: 40000e00, inuse: ffffffff
(XEN) domain_page.c:125:d1 mfn: 1eb480, [0]: fdbfffff, 
~ffffffff02400000, idx: 22 garbage: 2400000, inuse: ffffffff
(XEN) domain_page.c:125:d1 mfn: 2067ca, [0]: fffff7ff, 
~ffffffff00000800, idx: 11 garbage: 800, inuse: ffffffff
(XEN) domain_page.c:125:d1 mfn: 183642, [0]: ffffffff, 
~ffffffff00000000, idx: 32 garbage: 0, inuse: ffffffff
(XEN) domain_page.c:170:d1 mfn (183642) -> 2 idx: 32(i:1,j:0), branch:9 
0xffffffff00000000
(XEN) domain_page.c:176:d1 [0] idx=13, mfn=0x203b00, refcnt: 0
(XEN) domain_page.c:176:d1 [1] idx=25, mfn=0x1839e1, refcnt: 0
(XEN) domain_page.c:176:d1 [2] idx=3, mfn=0x1824d2, refcnt: 0
(XEN) domain_page.c:176:d1 [3] idx=5, mfn=0x1eb48b, refcnt: 0
(XEN) domain_page.c:176:d1 [4] idx=28, mfn=0x203b04, refcnt: 0
(XEN) domain_page.c:176:d1 [5] idx=0, mfn=0x1eb485, refcnt: 0
(XEN) domain_page.c:176:d1 [6] idx=30, mfn=0x203afe, refcnt: 0
(XEN) domain_page.c:176:d1 [7] idx=20, mfn=0x203aff, refcnt: 0

And that does point the picture that we have exhausted the full 32 
entries of mapcache.

Now off to find out who is holding them and why. Aren''t these
operations
(map/unmap domain_page) suppose to be shortlived?> Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

konrad wilk

2013-Jun-11 21:06 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

On 6/11/2013 2:52 PM, konrad wilk wrote:>
>> The BUG_ON() here is definitely valid - a few lines down, after the
>> enclosing if(), we use it in ways that requires this to not have
>> triggered. It basically tells you whether an in range idx was found,
>> which apparently isn''t the case here.
>>
>> As I think George already pointed out - printing accum here would
>> be quite useful: It should have at least one of the low 32 bits set,
>> given that dcache->entries must be at most 32 according to the
>> data you already got logged.
>
> With extra debugging (see attached patch)
>
> (XEN) domain_page.c:125:d1 mfn: 1eb483, [0]: bffff1ff, 
> ~ffffffff40000e00, idx: 9 garbage: 40000e00, inuse: ffffffff
> (XEN) domain_page.c:125:d1 mfn: 1eb480, [0]: fdbfffff, 
> ~ffffffff02400000, idx: 22 garbage: 2400000, inuse: ffffffff
> (XEN) domain_page.c:125:d1 mfn: 2067ca, [0]: fffff7ff, 
> ~ffffffff00000800, idx: 11 garbage: 800, inuse: ffffffff
> (XEN) domain_page.c:125:d1 mfn: 183642, [0]: ffffffff, 
> ~ffffffff00000000, idx: 32 garbage: 0, inuse: ffffffff
> (XEN) domain_page.c:170:d1 mfn (183642) -> 2 idx: 32(i:1,j:0), 
> branch:9 0xffffffff00000000
> (XEN) domain_page.c:176:d1 [0] idx=13, mfn=0x203b00, refcnt: 0
> (XEN) domain_page.c:176:d1 [1] idx=25, mfn=0x1839e1, refcnt: 0
> (XEN) domain_page.c:176:d1 [2] idx=3, mfn=0x1824d2, refcnt: 0
> (XEN) domain_page.c:176:d1 [3] idx=5, mfn=0x1eb48b, refcnt: 0
> (XEN) domain_page.c:176:d1 [4] idx=28, mfn=0x203b04, refcnt: 0
> (XEN) domain_page.c:176:d1 [5] idx=0, mfn=0x1eb485, refcnt: 0
> (XEN) domain_page.c:176:d1 [6] idx=30, mfn=0x203afe, refcnt: 0
> (XEN) domain_page.c:176:d1 [7] idx=20, mfn=0x203aff, refcnt: 0
>
> And that does point the picture that we have exhausted the full 32 
> entries of mapcache.
>
> Now off to find out who is holding them and why. Aren''t these 
> operations (map/unmap domain_page) suppose to be shortlived?
And found the culprit. With some EIP logging:

(XEN) domain_page.c:214:d1 [0] mfn=0x1ff67a idx=0, mfn=0x1ff67a, refcnt: 
0 [EIP=0]
(XEN) domain_page.c:216:d1 [1] mfn=18fef2, [EIP=0]
(XEN) domain_page.c:216:d1 [2] mfn=1eb518, 
[tmh_persistent_pool_page_get+0x26d/0x2d8]
(XEN) domain_page.c:216:d1 [3] mfn=170a08, 
[tmh_persistent_pool_page_get+0x26d/0x2d8]
(XEN) domain_page.c:216:d1 [4] mfn=18feef, [EIP=0]
(XEN) domain_page.c:216:d1 [5] mfn=1eb4c8, 
[tmh_persistent_pool_page_get+0x26d/0x2d8]
(XEN) domain_page.c:216:d1 [6] mfn=202699, 
[tmh_persistent_pool_page_get+0x26d/0x2d8]
(XEN) domain_page.c:216:d1 [7] mfn=18fef0, [EIP=0]
(XEN) domain_page.c:216:d1 [8] mfn=0, [EIP=0]
(XEN) domain_page.c:214:d1 [9] mfn=0x18e7ed idx=9, mfn=0x18e7ed, refcnt: 
0 [EIP=0]
(XEN) domain_page.c:214:d1 [10] mfn=0x18f629 idx=10, mfn=0x18f629, 
refcnt: 0 [EIP=0]
(XEN) domain_page.c:216:d1 [11] mfn=1eb47e, 
[tmh_persistent_pool_page_get+0x26d/0x2d8]
(XEN) domain_page.c:214:d1 [12] mfn=0x18209c idx=12, mfn=0x18209c, 
refcnt: 0 [EIP=0]
(XEN) domain_page.c:216:d1 [13] mfn=18fef5, [EIP=0]
(XEN) domain_page.c:214:d1 [14] mfn=0x18f62b idx=14, mfn=0x18f62b, 
refcnt: 0 [EIP=0]
(XEN) domain_page.c:216:d1 [15] mfn=1eb459, 
[tmh_persistent_pool_page_get+0x26d/0x2d8]
(XEN) domain_page.c:216:d1 [16] mfn=1eb512, 
[tmh_persistent_pool_page_get+0x26d/0x2d8]
(XEN) domain_page.c:216:d1 [17] mfn=170d2b, 
[tmh_persistent_pool_page_get+0x26d/0x2d8]
(XEN) domain_page.c:216:d1 [18] mfn=20272b, 
[tmh_persistent_pool_page_get+0x26d/0x2d8]
(XEN) domain_page.c:216:d1 [19] mfn=16c22c, 
[tmh_persistent_pool_page_get+0x26d/0x2d8]
(XEN) domain_page.c:216:d1 [20] mfn=18fef4, [EIP=0]
(XEN) domain_page.c:216:d1 [21] mfn=18e7e9, [EIP=0]
(XEN) domain_page.c:216:d1 [22] mfn=18feee, [EIP=0]
(XEN) domain_page.c:216:d1 [23] mfn=1eb4a3, 
[tmh_persistent_pool_page_get+0x26d/0x2d8]
(XEN) domain_page.c:216:d1 [24] mfn=18fef3, [EIP=0]
(XEN) domain_page.c:214:d1 [25] mfn=0x18f62f idx=25, mfn=0x18f62f, 
refcnt: 0 [EIP=0]
(XEN) domain_page.c:216:d1 [26] mfn=18ff02, [__get_page_type+0x1001/0x146a]
(XEN) domain_page.c:214:d1 [27] mfn=0x18fefe idx=27, mfn=0x18fefe, 
refcnt: 0 [EIP=0]
(XEN) domain_page.c:216:d1 [28] mfn=18ff00, [__get_page_type+0xcc3/0x146a]
(XEN) domain_page.c:216:d1 [29] mfn=0, [EIP=0]
(XEN) domain_page.c:214:d1 [30] mfn=0x18f628 idx=30, mfn=0x18f628, 
refcnt: 0 [EIP=0]
(XEN) domain_page.c:216:d1 [31] mfn=1eb4ed, 
[tmh_persistent_pool_page_get+0x26d/0x2d8]

And a brief look at the code it looks as any calls to the xmalloc_pool 
code ends up
calling map_domain_page. Since most of the tmem code is using the pool 
to store
guest pages (looking briefly at tmem_malloc), this would explain why we 
ran out of
32 slots. Especially as we don''t free them until the guest puts the 
persistent pages back.

The fix.. well, not yet here but I think it would be mostly 
concentrating around
tmem code.

Thanks for suggestion on looking at the accum value.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Jun-12 06:38 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

>>> On 11.06.13 at 23:06, konrad wilk <konrad.wilk@oracle.com>
wrote:
> On 6/11/2013 2:52 PM, konrad wilk wrote:
>> And that does point the picture that we have exhausted the full 32 
>> entries of mapcache.
>>
>> Now off to find out who is holding them and why. Aren''t these 
>> operations (map/unmap domain_page) suppose to be shortlived?
Yes, they are.
> And found the culprit. With some EIP logging:
> [...]
> (XEN) domain_page.c:216:d1 [31] mfn=1eb4ed, 
> [tmh_persistent_pool_page_get+0x26d/0x2d8]
> 
> And a brief look at the code it looks as any calls to the xmalloc_pool 
> code ends up
> calling map_domain_page. Since most of the tmem code is using the pool 
> to store
> guest pages (looking briefly at tmem_malloc), this would explain why we 
> ran out of
> 32 slots. Especially as we don''t free them until the guest puts
the
> persistent pages back.
Yes, this is (and never was) a valid use model for
map_domain_page(). What''s really odd is the respective difference
between tmh_mempool_page_get() (using page_to_virt() on the
result of tmh_alloc_page()) and tmh_persistent_pool_page_get()
(using __map_domain_page() on what _tmh_alloc_page_thispool()
returned), while both allocation functions end up calling
alloc_domheap_page(). With Dan no longer around, it may be
hard to understand the reasons behind this brokenness.

As tmem gets disabled anyway when there is memory not covered
by the 1:1 mapping, switching tmh_persistent_pool_page_get() to
use page_to_virt() would appear to be the obvious immediate
solution. Re-enabling it on such huge memory systems is going to
require a re-design anyway afaict.

Jan

George Dunlap

2013-Jun-12 11:00 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

create ^
title it map_domain_page second-stage emergency fallback path never taken
thanks

On Tue, Jun 11, 2013 at 7:52 PM, konrad wilk <konrad.wilk@oracle.com>
wrote:>
>> The BUG_ON() here is definitely valid - a few lines down, after the
>> enclosing if(), we use it in ways that requires this to not have
>> triggered. It basically tells you whether an in range idx was found,
>> which apparently isn''t the case here.
>>
>> As I think George already pointed out - printing accum here would
>> be quite useful: It should have at least one of the low 32 bits set,
>> given that dcache->entries must be at most 32 according to the
>> data you already got logged.
>
>
> With extra debugging (see attached patch)
>
> (XEN) domain_page.c:125:d1 mfn: 1eb483, [0]: bffff1ff, ~ffffffff40000e00,
> idx: 9 garbage: 40000e00, inuse: ffffffff
> (XEN) domain_page.c:125:d1 mfn: 1eb480, [0]: fdbfffff, ~ffffffff02400000,
> idx: 22 garbage: 2400000, inuse: ffffffff
> (XEN) domain_page.c:125:d1 mfn: 2067ca, [0]: fffff7ff, ~ffffffff00000800,
> idx: 11 garbage: 800, inuse: ffffffff
> (XEN) domain_page.c:125:d1 mfn: 183642, [0]: ffffffff, ~ffffffff00000000,
> idx: 32 garbage: 0, inuse: ffffffff
So regardless of the fact that tmem is obviously holding what are
supposed to be short-term references for so long, there is something
that seems not quite right about this failure path.

It looks like the algorithm is:
1. Clean the garbage map and update the inuse list
2. If anything has been cleaned up, use the first not-inuse entry
3. Otherwise, do something else ("replace a hash entry" -- not sure
exactly what that means).

What we see above is that this failure path succeeds three times, but
fails the fourth time: there are, in fact, no zero entries after the
garbage clean-up; however, because "inuse" is 32-bit (effectively) and
"accum" is 64-bit, ~inuse always has bits 32-63 set, and so will
always return true and never fall back to the "something else"

This is probably not something we need to fix for 4.3, but we should
put it on our to-do list.

 -George

xen@bugs.xenproject.org

2013-Jun-12 11:15 UTC

head link

Processed: Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Processing commands for xen@bugs.xenproject.org:
> create ^Created new bug #14 rooted at `<51B7720B.10607@oracle.com>''
Title: `Re: [Xen-devel] Xen 4.3 + tmem = Xen BUG at
domain_page.c:143''> title it map_domain_page second-stage emergency fallback path never takenSet title for #14 to `map_domain_page second-stage emergency fallback path never
taken''> thanksFinished processing.

Modified/created Bugs:
 - 14: http://bugs.xenproject.org/xen/bug/14 (new)

---
Xen Hypervisor Bug Tracker
See http://wiki.xen.org/wiki/Reporting_Bugs_against_Xen for information on
reporting bugs
Contact xen-bugs-owner@bugs.xenproject.org with any infrastructure issues

George Dunlap

2013-Jun-12 11:37 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

On Wed, Jun 12, 2013 at 12:00 PM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:> This is probably not something we need to fix for 4.3, but we should
> put it on our to-do list.
Konrad,

Could you try the attached patch with your debug patch, to see if it
successfully falls back to the "replace a map hash" patch?

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Jun-12 12:12 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

>>> On 12.06.13 at 13:00, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> create ^
> title it map_domain_page second-stage emergency fallback path never taken
> thanks
> 
> On Tue, Jun 11, 2013 at 7:52 PM, konrad wilk <konrad.wilk@oracle.com>
wrote:
>>
>>> The BUG_ON() here is definitely valid - a few lines down, after the
>>> enclosing if(), we use it in ways that requires this to not have
>>> triggered. It basically tells you whether an in range idx was
found,
>>> which apparently isn''t the case here.
>>>
>>> As I think George already pointed out - printing accum here would
>>> be quite useful: It should have at least one of the low 32 bits
set,
>>> given that dcache->entries must be at most 32 according to the
>>> data you already got logged.
>>
>>
>> With extra debugging (see attached patch)
>>
>> (XEN) domain_page.c:125:d1 mfn: 1eb483, [0]: bffff1ff,
~ffffffff40000e00,
>> idx: 9 garbage: 40000e00, inuse: ffffffff
>> (XEN) domain_page.c:125:d1 mfn: 1eb480, [0]: fdbfffff,
~ffffffff02400000,
>> idx: 22 garbage: 2400000, inuse: ffffffff
>> (XEN) domain_page.c:125:d1 mfn: 2067ca, [0]: fffff7ff,
~ffffffff00000800,
>> idx: 11 garbage: 800, inuse: ffffffff
>> (XEN) domain_page.c:125:d1 mfn: 183642, [0]: ffffffff,
~ffffffff00000000,
>> idx: 32 garbage: 0, inuse: ffffffff
> 
> So regardless of the fact that tmem is obviously holding what are
> supposed to be short-term references for so long, there is something
> that seems not quite right about this failure path.
> 
> It looks like the algorithm is:
> 1. Clean the garbage map and update the inuse list
> 2. If anything has been cleaned up, use the first not-inuse entry
> 3. Otherwise, do something else ("replace a hash entry" -- not
sure
> exactly what that means).
> 
> What we see above is that this failure path succeeds three times, but
> fails the fourth time: there are, in fact, no zero entries after the
> garbage clean-up; however, because "inuse" is 32-bit
(effectively) and
> "accum" is 64-bit, ~inuse always has bits 32-63 set, and so will
> always return true and never fall back to the "something else"
Right, that''s what occurred to me too yesterday, but the again
I knew I had seen this code path executed. Now that I look again,
I think I understand why: All of my Dom0-s and typical DomU-s
have a vCPU count divisible by 4, and with MAPCACHE_VCPU_ENTRIES
being 16, the full unsigned long would always be used.
> This is probably not something we need to fix for 4.3, but we should
> put it on our to-do list.
Actually I think we should fix this right away.

Jan

Jan Beulich

2013-Jun-12 12:46 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

>>> On 12.06.13 at 13:37, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> On Wed, Jun 12, 2013 at 12:00 PM, George Dunlap
> <George.Dunlap@eu.citrix.com> wrote:
>> This is probably not something we need to fix for 4.3, but we should
>> put it on our to-do list.
> 
> Could you try the attached patch with your debug patch, to see if it
> successfully falls back to the "replace a map hash" patch?
This could be a fix, but I''d prefer to do this with a change to an
entirely cold code path, i.e. by setting the unused bits in the last
word in mapcache_vcpu_init().

Jan

George Dunlap

2013-Jun-12 13:16 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

On 12/06/13 13:12, Jan Beulich wrote:>>>> On 12.06.13 at 13:00, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
>> create ^
>> title it map_domain_page second-stage emergency fallback path never
taken
>> thanks
>>
>> On Tue, Jun 11, 2013 at 7:52 PM, konrad wilk
<konrad.wilk@oracle.com> wrote:
>>>> The BUG_ON() here is definitely valid - a few lines down, after
the
>>>> enclosing if(), we use it in ways that requires this to not
have
>>>> triggered. It basically tells you whether an in range idx was
found,
>>>> which apparently isn''t the case here.
>>>>
>>>> As I think George already pointed out - printing accum here
would
>>>> be quite useful: It should have at least one of the low 32 bits
set,
>>>> given that dcache->entries must be at most 32 according to
the
>>>> data you already got logged.
>>>
>>> With extra debugging (see attached patch)
>>>
>>> (XEN) domain_page.c:125:d1 mfn: 1eb483, [0]: bffff1ff,
~ffffffff40000e00,
>>> idx: 9 garbage: 40000e00, inuse: ffffffff
>>> (XEN) domain_page.c:125:d1 mfn: 1eb480, [0]: fdbfffff,
~ffffffff02400000,
>>> idx: 22 garbage: 2400000, inuse: ffffffff
>>> (XEN) domain_page.c:125:d1 mfn: 2067ca, [0]: fffff7ff,
~ffffffff00000800,
>>> idx: 11 garbage: 800, inuse: ffffffff
>>> (XEN) domain_page.c:125:d1 mfn: 183642, [0]: ffffffff,
~ffffffff00000000,
>>> idx: 32 garbage: 0, inuse: ffffffff
>> So regardless of the fact that tmem is obviously holding what are
>> supposed to be short-term references for so long, there is something
>> that seems not quite right about this failure path.
>>
>> It looks like the algorithm is:
>> 1. Clean the garbage map and update the inuse list
>> 2. If anything has been cleaned up, use the first not-inuse entry
>> 3. Otherwise, do something else ("replace a hash entry" --
not sure
>> exactly what that means).
>>
>> What we see above is that this failure path succeeds three times, but
>> fails the fourth time: there are, in fact, no zero entries after the
>> garbage clean-up; however, because "inuse" is 32-bit
(effectively) and
>> "accum" is 64-bit, ~inuse always has bits 32-63 set, and so
will
>> always return true and never fall back to the "something
else"
> Right, that''s what occurred to me too yesterday, but the again
> I knew I had seen this code path executed. Now that I look again,
> I think I understand why: All of my Dom0-s and typical DomU-s
> have a vCPU count divisible by 4, and with MAPCACHE_VCPU_ENTRIES
> being 16, the full unsigned long would always be used.
>
>> This is probably not something we need to fix for 4.3, but we should
>> put it on our to-do list.
> Actually I think we should fix this right away.
How often is the second path taken in practice?

And, you said this doesn''t happen with debug=n builds -- why not
exactly?

I''m trying to assess the actual risk of not fixing it, vs the risk of 
fixing it.

  -George

Jan Beulich

2013-Jun-12 13:27 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

>>> On 12.06.13 at 15:16, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> On 12/06/13 13:12, Jan Beulich wrote:
>>>>> On 12.06.13 at 13:00, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
>>> create ^
>>> title it map_domain_page second-stage emergency fallback path never
taken
>>> thanks
>>>
>>> On Tue, Jun 11, 2013 at 7:52 PM, konrad wilk
<konrad.wilk@oracle.com> wrote:
>>>>> The BUG_ON() here is definitely valid - a few lines down,
after the
>>>>> enclosing if(), we use it in ways that requires this to not
have
>>>>> triggered. It basically tells you whether an in range idx
was found,
>>>>> which apparently isn''t the case here.
>>>>>
>>>>> As I think George already pointed out - printing accum here
would
>>>>> be quite useful: It should have at least one of the low 32
bits set,
>>>>> given that dcache->entries must be at most 32 according
to the
>>>>> data you already got logged.
>>>>
>>>> With extra debugging (see attached patch)
>>>>
>>>> (XEN) domain_page.c:125:d1 mfn: 1eb483, [0]: bffff1ff,
~ffffffff40000e00,
>>>> idx: 9 garbage: 40000e00, inuse: ffffffff
>>>> (XEN) domain_page.c:125:d1 mfn: 1eb480, [0]: fdbfffff,
~ffffffff02400000,
>>>> idx: 22 garbage: 2400000, inuse: ffffffff
>>>> (XEN) domain_page.c:125:d1 mfn: 2067ca, [0]: fffff7ff,
~ffffffff00000800,
>>>> idx: 11 garbage: 800, inuse: ffffffff
>>>> (XEN) domain_page.c:125:d1 mfn: 183642, [0]: ffffffff,
~ffffffff00000000,
>>>> idx: 32 garbage: 0, inuse: ffffffff
>>> So regardless of the fact that tmem is obviously holding what are
>>> supposed to be short-term references for so long, there is
something
>>> that seems not quite right about this failure path.
>>>
>>> It looks like the algorithm is:
>>> 1. Clean the garbage map and update the inuse list
>>> 2. If anything has been cleaned up, use the first not-inuse entry
>>> 3. Otherwise, do something else ("replace a hash entry"
-- not sure
>>> exactly what that means).
>>>
>>> What we see above is that this failure path succeeds three times,
but
>>> fails the fourth time: there are, in fact, no zero entries after
the
>>> garbage clean-up; however, because "inuse" is 32-bit
(effectively) and
>>> "accum" is 64-bit, ~inuse always has bits 32-63 set, and
so will
>>> always return true and never fall back to the "something
else"
>> Right, that''s what occurred to me too yesterday, but the again
>> I knew I had seen this code path executed. Now that I look again,
>> I think I understand why: All of my Dom0-s and typical DomU-s
>> have a vCPU count divisible by 4, and with MAPCACHE_VCPU_ENTRIES
>> being 16, the full unsigned long would always be used.
>>
>>> This is probably not something we need to fix for 4.3, but we
should
>>> put it on our to-do list.
>> Actually I think we should fix this right away.
> 
> How often is the second path taken in practice?
On non-debug builds, not at all except on systems with more than
5Tb (as all of the map_domain_page() code).

Once domain page mappings are needed, this depends on the use
pattern of the function. In any case this is going to be way more
frequent than the one time per-vCPU setup.
> And, you said this doesn''t happen with debug=n builds -- why not
exactly?
That was with the user (tmem) in mind: For <= 5Tb systems, as
said above map_domain_page() has a short cut. And for >5Tb
systems tmem gets turned off.

But any other users of the function could still run into this on
huge memory systems, and with this being one of the listed new
features of 4.3 I think we should fix it.

Jan

Konrad Rzeszutek Wilk

2013-Jun-12 14:13 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

On Wed, Jun 12, 2013 at 12:37:53PM +0100, George Dunlap
wrote:> On Wed, Jun 12, 2013 at 12:00 PM, George Dunlap
> <George.Dunlap@eu.citrix.com> wrote:
> > This is probably not something we need to fix for 4.3, but we should
> > put it on our to-do list.
> 
> Konrad,
> 
> Could you try the attached patch with your debug patch, to see if it
> successfully falls back to the "replace a map hash" patch?
It does, but I believe it falls in the BUG_ON scenario where
the idx = 32 and the BUG_ON gets hit. Please see the serial log:

(attached is the debug patch + your patch)

(XEN) domain_page.c:137:d1 mfn: 1eb533, [0]: ff000000->ff000000, garbage: 0
-> ~0 (mask: 0), idx: 0,  ~garbage: ffffffffffffffff
(XEN) domain_page.c:137:d1 mfn: 1eb864, [0]: ff000000->ff000000, garbage: 0
-> ~0 (mask: 0), idx: 0,  ~garbage: ffffffffffffffff
(XEN) domain_page.c:137:d1 mfn: 1eb862, [0]: ff000000->ff000000, garbage: 0
-> ~0 (mask: 0), idx: 0,  ~garbage: ffffffffffffffff
(XEN) domain_page.c:137:d1 mfn: 1eb531, [0]: ff000000->ff000000, garbage: 0
-> ~0 (mask: 0), idx: 0,  ~garbage: ffffffffffffffff
(XEN) domain_page.c:137:d1 mfn: 1ebd14, [0]: ff000000->ff000000, garbage: 0
-> ~0 (mask: 0), idx: 0,  ~garbage: ffffffffffffffff
(XEN) domain_page.c:137:d1 mfn: 18e8bb, [0]: ff000000->ff000000, garbage: 0
-> ~0 (mask: 0), idx: 0,  ~garbage: ffffffffffffffff
(XEN) domain_page.c:137:d1 mfn: 2062e2, [0]: ff000000->ff000000, garbage: 0
-> ~0 (mask: 0), idx: 0,  ~garbage: ffffffffffffffff
(XEN) domain_page.c:137:d1 mfn: 1eb525, [0]: ff000000->ff000000, garbage: 0
-> ~0 (mask: 0), idx: 0,  ~garbage: ffffffffffffffff
(XEN) domain_page.c:137:d1 mfn: 20653e, [0]: ff000000->ff000000, garbage: 0
-> ~0 (mask: 0), idx: 0,  ~garbage: ffffffffffffffff
(XEN) domain_page.c:137:d1 mfn: 1ed3a6, [0]: ff000000->ff000000, garbage: 0
-> ~0 (mask: 0), idx: 0,  ~garbage: ffffffffffffffff

(XEN) domain_page.c:137:d1 mfn: 1ed3a6, [0]: ff000000->7f000000, garbage:
80000000 -> ~0 (mask: 0), idx: 0,  ~garbage: ffffffff7fffffff
(XEN) domain_page.c:137:d1 mfn: 1eb525, [0]: ff000000->df000000, garbage:
20000000 -> ~0 (mask: 0), idx: 0,  ~garbage: ffffffffdfffffff
(XEN) domain_page.c:137:d1 mfn: 1ed3a4, [0]: df000000->df000000, garbage: 0
-> ~0 (mask: 0), idx: 0,  ~garbage: ffffffffffffffff
 

BusyBox v1.14.3 (2013-06-10 16:30:12 EDT) built-in shell (ash)
Enter ''help'' for a list of built-in commands.

# 
(XEN) domain_page.c:137:d1 mfn: 18e8c8, [0]: ff000000->7f000000, garbage:
80000000 -> ~0 (mask: 0), idx: 0,  ~garbage: ffffffff7fffffff

(XEN) domain_page.c:137:d1 mfn: 1eb531, [0]: ff000000->ef000000, garbage:
10000000 -> ~0 (mask: 0), idx: 0,  ~garbage: ffffffffeb531) -> 1 idx:
32(i:1,j:8), branch:a accum: 0x0
(XEN) domain_page.c:222:d1 [0] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [1] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [2] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [3] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [4] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [5] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [6] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [7] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [8] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [9] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [10] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [11] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [12] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [13] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [14] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [15] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [16] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [17] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [18] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [19] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [20] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [21] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [22] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [23] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [24] mfn=1eb531, [EIP=0]
(XEN) domain_page.c:222:d1 [25] mfn=2062e2, [EIP=0]
(XEN) domain_page.c:222:d1 [26] mfn=18e8bb, [EIP=0]
(XEN) domain_page.c:222:d1 [27] mfn=18e8c8, [EIP=0]
(XEN) domain_page.c:222:d1 [28] mfn=0, [EIP=0]
(XEN) domain_page.c:222:d1 [29] mfn=1ed3a5, [EIP=0]
(XEN) domain_page.c:222:d1 [30] mfn=1ed3a6, [EIP=0]
(XEN) domain_page.c:222:d1 [31] mfn=1ebc14, [EIP=0]
(XEN) Xen BUG at domain_page.c:226
(XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c4c01609bf>] map_domain_page+0x920/0xa03
(XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: ffff8300c68f8000   rcx: 0000000000000000
(XEN) rdx: ffff82c4c03191a0   rsi: 000000000000000a   rdi: ffff82c4c027a6e8
(XEN) rbp: ffff82c4c02c7e58   rsp: ffff82c4c02c7d78   r8:  0000000000000004
(XEN) r9:  0000000000000004   r10: 0000000000000004   r11: 0000000000000001
(XEN) r12: ffff83022d7b7000   r13: 00000000001eb531   r14: 0000000000000020
(XEN) r15: 0000000000000020   cr0: 000000008005003b   cr4: 00000000000426f0
(XEN) cr3: 000000020653e000   cr2: 00007f808ece33e0
(XEN) ds: 002b   es: 002b   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff82c4c02c7d78:
(XEN)    ffff82c4c02f0de0 0000000000000001 0000000000000008 000000000000000a
(XEN)    0000000000000000 ffff830000000000 ffffffffefffffff 0000000000000000
(XEN)    0000000000000000 0000002000000000 00000000ff000000 00000000001eb531
(XEN)    ffff82c4c02c7e20 0000000000000020 ffff82c4c02c7e18 ffff8300c68f83c8
(XEN)    0000000ac02c7f08 ffff82c400000000 ffff83022d7b72d8 0000000000000286
(XEN)    ffff82c4c03191c0 0000000000000002 0000000900000000 ffff83022d7b7000
(XEN)    ffff83022d7b7000 ffff8300c68f8000 0000000000000000 ffff88003f904000
(XEN)    ffff82c4c02c7ee8 ffff82c4c017a242 ffffffff81039c19 00000000000426f0
(XEN)    80100001ebc14065 ffff82c4c02c0000 ffffffffffffffff ffff82c4c02c7ed0
(XEN)    ffff82c4c0126e77 ffff82c4c02e0000 ffff82c4c02c00e7 00000000001eb531
(XEN)    ffffffffffffffff ffff8300c68f8000 0000000000000000 ffff88003e29fe40
(XEN)    ffff88003e29fee0 0000000000000080 ffff82c4c02c7ef8 ffff82c4c017a50f
(XEN)    00007d3b3fd380c7 ffff82c4c0223c4b ffffffff810011ca 000000000000000e
(XEN)    ffff88003e27f800 ffff88003f90c740 0000000000000001 ffff88003f911640
(XEN)    ffff88003e29fe30 ffff88003f904000 0000000000000286 0000000000000000
(XEN)    ffff880000000000 00003ffffffff000 000000000000000e ffffffff810011ca
(XEN)    0000000000000000 80100001ebc14065 ffff88003f904000 0001010000000000
(XEN)    ffffffff810011ca 000000000000e033 0000000000000286 ffff88003e29fdf8
(XEN)    000000000000e02b 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 ffff8300c68f8000 0000000000000000
(XEN) Xen call trace:
(XEN)    [<ffff82c4c01609bf>] map_domain_page+0x920/0xa03
(XEN)    [<ffff82c4c017a242>] __do_update_va_mapping+0xd6/0x316
(XEN)    [<ffff82c4c017a50f>] do_update_va_mapping+0x1e/0x22
(XEN)    [<ffff82c4c0223c4b>] syscall_enter+0xeb/0x145
(XEN)    
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Xen BUG at domain_page.c:226
(XEN) ****************************************
(XEN) 
(XEN) Manual reset required (''noreboot'' specified)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Keir Fraser

2013-Jun-12 15:11 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

On 12/06/2013 13:12, "Jan Beulich" <JBeulich@suse.com> wrote:
>> What we see above is that this failure path succeeds three times, but
>> fails the fourth time: there are, in fact, no zero entries after the
>> garbage clean-up; however, because "inuse" is 32-bit
(effectively) and
>> "accum" is 64-bit, ~inuse always has bits 32-63 set, and so
will
>> always return true and never fall back to the "something
else"
> 
> Right, that''s what occurred to me too yesterday, but the again
> I knew I had seen this code path executed. Now that I look again,
> I think I understand why: All of my Dom0-s and typical DomU-s
> have a vCPU count divisible by 4, and with MAPCACHE_VCPU_ENTRIES
> being 16, the full unsigned long would always be used.
Why are we so tight on MAPCACHE_VCPU_ENTRIES? Why not say double that number
and get rid of the accum and the ''replace a hash entry
instead'' logic
instead? We never used to have it, and it''s kind of extra complication
and a
bit gross.

 -- Keir

Keir Fraser

2013-Jun-12 15:27 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

On 12/06/2013 16:11, "Keir Fraser" <keir.xen@gmail.com> wrote:
> On 12/06/2013 13:12, "Jan Beulich" <JBeulich@suse.com>
wrote:
> 
>>> What we see above is that this failure path succeeds three times,
but
>>> fails the fourth time: there are, in fact, no zero entries after
the
>>> garbage clean-up; however, because "inuse" is 32-bit
(effectively) and
>>> "accum" is 64-bit, ~inuse always has bits 32-63 set, and
so will
>>> always return true and never fall back to the "something
else"
>> 
>> Right, that''s what occurred to me too yesterday, but the again
>> I knew I had seen this code path executed. Now that I look again,
>> I think I understand why: All of my Dom0-s and typical DomU-s
>> have a vCPU count divisible by 4, and with MAPCACHE_VCPU_ENTRIES
>> being 16, the full unsigned long would always be used.
> 
> Why are we so tight on MAPCACHE_VCPU_ENTRIES? Why not say double that
number
> and get rid of the accum and the ''replace a hash entry
instead'' logic
> instead? We never used to have it, and it''s kind of extra
complication and a
> bit gross.
Could even pull MAPHASH_ENTRIES into config.h and explicitly add it to
MAPCACHE_VCPU_ENTRIES. That would be neat would it not? And I believe we
have space in the mapcache''s perdomain slot to enlarge
MAPCACHE_VCPU_ENTRIES
in this way (i.e., from 16 to 24).

 -- Keir

>  -- Keir
> 
>

Jan Beulich

2013-Jun-12 15:48 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

>>> On 12.06.13 at 17:11, Keir Fraser <keir.xen@gmail.com> wrote:
> On 12/06/2013 13:12, "Jan Beulich" <JBeulich@suse.com>
wrote:
> 
>>> What we see above is that this failure path succeeds three times,
but
>>> fails the fourth time: there are, in fact, no zero entries after
the
>>> garbage clean-up; however, because "inuse" is 32-bit
(effectively) and
>>> "accum" is 64-bit, ~inuse always has bits 32-63 set, and
so will
>>> always return true and never fall back to the "something
else"
>> 
>> Right, that''s what occurred to me too yesterday, but the again
>> I knew I had seen this code path executed. Now that I look again,
>> I think I understand why: All of my Dom0-s and typical DomU-s
>> have a vCPU count divisible by 4, and with MAPCACHE_VCPU_ENTRIES
>> being 16, the full unsigned long would always be used.
> 
> Why are we so tight on MAPCACHE_VCPU_ENTRIES? Why not say double that
number
> and get rid of the accum and the ''replace a hash entry
instead'' logic
> instead? We never used to have it, and it''s kind of extra
complication and a
> bit gross.
First of all, doubling the entries is not an argument for dropping
that code - the old 32-bit implementation really would have
needed this too from a theoretical perspective: The number of
readily available (garbage) entries is bounded by
MAPCACHE_VCPU_ENTRIES - MAPHASH_ENTRIES (because the
hash entries actively block getting treated as garbage). Adding
MAPHASH_ENTRIES into the calculation of MAPCACHE_VCPU_ENTRIES
would result in header dependency problems (I tried this when I
re-activated that code, as ideally we would want to set this value
to what we anticipate we might need _plus_ MAPHASH_ENTRIES).

Consequently the code should be prepared to recover entries
from the hash anyway (this is particularly relevant when
MAPCACHE_VCPU_ENTRIES <= MAPHASH_ENTRIES - while that''s
not the case currently, if someone decided to double the latter
it would be).

Finally, the VA range for this already is an order-17 block (with
the inuse and garbage maps added at the end, i.e. it''s slightly
above 512M), so doubling would require the code to be
adjusted to handle a single per-domain block covering two per-
domain slots (each slot covering 1Gb).

Jan

Jan Beulich

2013-Jun-12 15:54 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

>>> On 12.06.13 at 17:27, Keir Fraser <keir.xen@gmail.com> wrote:
> On 12/06/2013 16:11, "Keir Fraser" <keir.xen@gmail.com>
wrote:
> 
>> On 12/06/2013 13:12, "Jan Beulich" <JBeulich@suse.com>
wrote:
>> 
>>>> What we see above is that this failure path succeeds three
times, but
>>>> fails the fourth time: there are, in fact, no zero entries
after the
>>>> garbage clean-up; however, because "inuse" is 32-bit
(effectively) and
>>>> "accum" is 64-bit, ~inuse always has bits 32-63 set,
and so will
>>>> always return true and never fall back to the "something
else"
>>> 
>>> Right, that''s what occurred to me too yesterday, but the
again
>>> I knew I had seen this code path executed. Now that I look again,
>>> I think I understand why: All of my Dom0-s and typical DomU-s
>>> have a vCPU count divisible by 4, and with MAPCACHE_VCPU_ENTRIES
>>> being 16, the full unsigned long would always be used.
>> 
>> Why are we so tight on MAPCACHE_VCPU_ENTRIES? Why not say double that
number
>> and get rid of the accum and the ''replace a hash entry
instead'' logic
>> instead? We never used to have it, and it''s kind of extra
complication and a
>> bit gross.
> 
> Could even pull MAPHASH_ENTRIES into config.h and explicitly add it to
> MAPCACHE_VCPU_ENTRIES. That would be neat would it not? And I believe we
> have space in the mapcache''s perdomain slot to enlarge
MAPCACHE_VCPU_ENTRIES
> in this way (i.e., from 16 to 24).
Yes, this would an option, but I very much dislike pulling further
stuff not belonging there into config.h - this define really best
lives side by side with its related definitions.

While I just checked the code and think that this not being a power
of two should be fine, I''d nevertheless be careful with such a
change.

Jan

Keir Fraser

2013-Jun-12 17:26 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

On 12/06/2013 16:48, "Jan Beulich" <JBeulich@suse.com> wrote:
>> Why are we so tight on MAPCACHE_VCPU_ENTRIES? Why not say double that
number
>> and get rid of the accum and the ''replace a hash entry
instead'' logic
>> instead? We never used to have it, and it''s kind of extra
complication and a
>> bit gross.
> 
> First of all, doubling the entries is not an argument for dropping
> that code - the old 32-bit implementation really would have
> needed this too from a theoretical perspective: The number of
> readily available (garbage) entries is bounded by
> MAPCACHE_VCPU_ENTRIES - MAPHASH_ENTRIES (because the
> hash entries actively block getting treated as garbage).
So? We have control over both MAPCACHE_VCPU_ENTRUES and MAPHASH_ENTRIES. We
can make these somewhat arbitrary constants big

George Dunlap

2013-Jul-05 16:56 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

On Wed, Jun 12, 2013 at 6:26 PM, Keir Fraser <keir.xen@gmail.com>
wrote:> On 12/06/2013 16:48, "Jan Beulich" <JBeulich@suse.com>
wrote:
>
>>> Why are we so tight on MAPCACHE_VCPU_ENTRIES? Why not say double
that number
>>> and get rid of the accum and the ''replace a hash entry
instead'' logic
>>> instead? We never used to have it, and it''s kind of extra
complication and a
>>> bit gross.
>>
>> First of all, doubling the entries is not an argument for dropping
>> that code - the old 32-bit implementation really would have
>> needed this too from a theoretical perspective: The number of
>> readily available (garbage) entries is bounded by
>> MAPCACHE_VCPU_ENTRIES - MAPHASH_ENTRIES (because the
>> hash entries actively block getting treated as garbage).
>
> So? We have control over both MAPCACHE_VCPU_ENTRUES and MAPHASH_ENTRIES. We
> can make these somewhat arbitrary constants big
So is this bug fixed now?  Can we close it in the bug tracker?

 -George

Jan Beulich

2013-Jul-08 08:58 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

>>> On 05.07.13 at 18:56, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> On Wed, Jun 12, 2013 at 6:26 PM, Keir Fraser <keir.xen@gmail.com>
wrote:
>> On 12/06/2013 16:48, "Jan Beulich" <JBeulich@suse.com>
wrote:
>>
>>>> Why are we so tight on MAPCACHE_VCPU_ENTRIES? Why not say
double that number
>>>> and get rid of the accum and the ''replace a hash entry
instead'' logic
>>>> instead? We never used to have it, and it''s kind of
extra complication and a
>>>> bit gross.
>>>
>>> First of all, doubling the entries is not an argument for dropping
>>> that code - the old 32-bit implementation really would have
>>> needed this too from a theoretical perspective: The number of
>>> readily available (garbage) entries is bounded by
>>> MAPCACHE_VCPU_ENTRIES - MAPHASH_ENTRIES (because the
>>> hash entries actively block getting treated as garbage).
>>
>> So? We have control over both MAPCACHE_VCPU_ENTRUES and
MAPHASH_ENTRIES. We
>> can make these somewhat arbitrary constants big
> 
> So is this bug fixed now?  Can we close it in the bug tracker?
The bug itself got fixed, but the number of hash entries didn''t
get increased so far (and as previously indicated I''m not
intending to do so; Keir indicated he might).

Jan

George Dunlap

2013-Jul-08 09:07 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

close ^
thanks

On Mon, Jul 8, 2013 at 9:58 AM, Jan Beulich <JBeulich@suse.com>
wrote:>>>> On 05.07.13 at 18:56, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
>> On Wed, Jun 12, 2013 at 6:26 PM, Keir Fraser <keir.xen@gmail.com>
wrote:
>>> On 12/06/2013 16:48, "Jan Beulich"
<JBeulich@suse.com> wrote:
>>>
>>>>> Why are we so tight on MAPCACHE_VCPU_ENTRIES? Why not say
double that number
>>>>> and get rid of the accum and the ''replace a hash
entry instead'' logic
>>>>> instead? We never used to have it, and it''s kind
of extra complication and a
>>>>> bit gross.
>>>>
>>>> First of all, doubling the entries is not an argument for
dropping
>>>> that code - the old 32-bit implementation really would have
>>>> needed this too from a theoretical perspective: The number of
>>>> readily available (garbage) entries is bounded by
>>>> MAPCACHE_VCPU_ENTRIES - MAPHASH_ENTRIES (because the
>>>> hash entries actively block getting treated as garbage).
>>>
>>> So? We have control over both MAPCACHE_VCPU_ENTRUES and
MAPHASH_ENTRIES. We
>>> can make these somewhat arbitrary constants big
>>
>> So is this bug fixed now?  Can we close it in the bug tracker?
>
> The bug itself got fixed, but the number of hash entries didn''t
> get increased so far (and as previously indicated I''m not
> intending to do so; Keir indicated he might).
>
> Jan
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

xen@bugs.xenproject.org

2013-Jul-08 09:15 UTC

head link

Processed: Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Processing commands for xen@bugs.xenproject.org:
> close ^Command failed: Cannot parse arguments at
/srv/xen-devel-bugs/lib/emesinae/control.pl line 325, <M> line 41.
Stop processing here.

---
Xen Hypervisor Bug Tracker
See http://wiki.xen.org/wiki/Reporting_Bugs_against_Xen for information on
reporting bugs
Contact xen-bugs-owner@bugs.xenproject.org with any infrastructure issues

George Dunlap

2013-Jul-08 09:25 UTC

head link

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

close 14
thanks

On Mon, Jul 8, 2013 at 10:07 AM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:> close ^
> thanks
>
> On Mon, Jul 8, 2013 at 9:58 AM, Jan Beulich <JBeulich@suse.com>
wrote:
>>>>> On 05.07.13 at 18:56, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
>>> On Wed, Jun 12, 2013 at 6:26 PM, Keir Fraser
<keir.xen@gmail.com> wrote:
>>>> On 12/06/2013 16:48, "Jan Beulich"
<JBeulich@suse.com> wrote:
>>>>
>>>>>> Why are we so tight on MAPCACHE_VCPU_ENTRIES? Why not
say double that number
>>>>>> and get rid of the accum and the ''replace a
hash entry instead'' logic
>>>>>> instead? We never used to have it, and it''s
kind of extra complication and a
>>>>>> bit gross.
>>>>>
>>>>> First of all, doubling the entries is not an argument for
dropping
>>>>> that code - the old 32-bit implementation really would have
>>>>> needed this too from a theoretical perspective: The number
of
>>>>> readily available (garbage) entries is bounded by
>>>>> MAPCACHE_VCPU_ENTRIES - MAPHASH_ENTRIES (because the
>>>>> hash entries actively block getting treated as garbage).
>>>>
>>>> So? We have control over both MAPCACHE_VCPU_ENTRUES and
MAPHASH_ENTRIES. We
>>>> can make these somewhat arbitrary constants big
>>>
>>> So is this bug fixed now?  Can we close it in the bug tracker?
>>
>> The bug itself got fixed, but the number of hash entries
didn''t
>> get increased so far (and as previously indicated I''m not
>> intending to do so; Keir indicated he might).
>>
>> Jan
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel

xen@bugs.xenproject.org

2013-Jul-08 09:30 UTC

head link

Processed: Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Processing commands for xen@bugs.xenproject.org:
> close 14
Closing bug #14> thanksFinished processing.

Modified/created Bugs:
 - 14: http://bugs.xenproject.org/xen/bug/14

---
Xen Hypervisor Bug Tracker
See http://wiki.xen.org/wiki/Reporting_Bugs_against_Xen for information on
reporting bugs
Contact xen-bugs-owner@bugs.xenproject.org with any infrastructure issues

Xen devel - Jun 2013 - Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Processed: Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Processed: Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143

Processed: Re: Xen 4.3 + tmem = Xen BUG at domain_page.c:143