thr3ads.net - Xen devel - dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0

If this information is useful, please help other people find it:
Share via:

Konrad Rzeszutek Wilk

2012-Sep-04 16:33 UTC

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

On Tue, Sep 04, 2012 at 06:37:57PM +0200, Sander Eikelenboom
wrote:> Hi Konrad,
> 
> This seems to happen only on a intel machine i''m trying to setup
as a development machine (haven''t seen it on my amd).
> It boots fine, i have dom0_mem=1024M,max:1024M set, the machine has 2G of
mem.
Is this only with Xen 4.2? As, does Xen 4.1 work?> 
> Dom0 and guest kernel are 3.6.0-rc4 with config:
If you back out:

f393387d160211f60398d58463a7e65
Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date:   Fri Aug 17 16:43:28 2012 -0400

    xen/setup: Fix one-off error when adding for-balloon PFNs to the P2M.

Do you see this bug? (Either with Xen 4.1 or Xen 4.2)?
> [*] Xen memory balloon driver
> [*]   Scrub pages before returning them to system
> 
> From
http://wiki.xen.org/wiki/Do%EF%BB%BFm0_Memory_%E2%80%94_Where_It_Has_Not_Gone ,
I thought this should be okay
> 
> But when trying to start a PV guest with 512MB mem, the machine (dom0)
crashes with the stacktrace below (complete serial-log.txt attached).
> 
> From the:
> "mapping kernel into physical memory
> about to get started..."
> 
> I would almost say it''s trying to reload dom0 ?
> 
> 
> [  897.161119] device vif1.0 entered promiscuous mode
> mapping kernel into physical memory
> about to get started...
> [  897.696619] xen_bridge: port 1(vif1.0) entered forwarding state
> [  897.716219] xen_bridge: port 1(vif1.0) entered forwarding state
> [  898.129465] ------------[ cut here ]------------
> [  898.132209] kernel BUG at drivers/xen/balloon.c:359!
> [  898.132209] invalid opcode: 0000 [#1] PREEMPT SMP

Sander Eikelenboom

2012-Sep-04 16:37 UTC

head link

dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Hi Konrad,

This seems to happen only on a intel machine i''m trying to setup as a
development machine (haven''t seen it on my amd).
It boots fine, i have dom0_mem=1024M,max:1024M set, the machine has 2G of mem.

Dom0 and guest kernel are 3.6.0-rc4 with config:
[*] Xen memory balloon driver
[*]   Scrub pages before returning them to system

From
http://wiki.xen.org/wiki/Do%EF%BB%BFm0_Memory_%E2%80%94_Where_It_Has_Not_Gone ,
I thought this should be okay

But when trying to start a PV guest with 512MB mem, the machine (dom0) crashes
with the stacktrace below (complete serial-log.txt attached).

From the:
"mapping kernel into physical memory
about to get started..."

I would almost say it''s trying to reload dom0 ?


[  897.161119] device vif1.0 entered promiscuous mode
mapping kernel into physical memory
about to get started...
[  897.696619] xen_bridge: port 1(vif1.0) entered forwarding state
[  897.716219] xen_bridge: port 1(vif1.0) entered forwarding state
[  898.129465] ------------[ cut here ]------------
[  898.132209] kernel BUG at drivers/xen/balloon.c:359!
[  898.132209] invalid opcode: 0000 [#1] PREEMPT SMP 
[  898.132209] Modules linked in:
[  898.132209] CPU 0 
[  898.132209] Pid: 3338, comm: kworker/0:1 Not tainted 3.6.0-rc4-20120830+ #66
System manufacturer System Product Name/P5Q-EM DO
[  898.132209] RIP: e030:[<ffffffff8133b206>]  [<ffffffff8133b206>]
balloon_process+0x336/0x340
[  898.132209] RSP: e02b:ffff880037b4dce0  EFLAGS: 00010213
[  898.132209] RAX: 00000000242b0000 RBX: ffffea0000dfadc0 RCX: 0000000000000000
[  898.132209] RDX: 0000000000037eb7 RSI: 00000000deadbeef RDI: 00000000000000b7
[  898.132209] RBP: ffff880037b4dd40 R08: ffffea0000dfade0 R09: 2222222222222222
[  898.132209] R10: 2222222222222222 R11: 2222222222222222 R12: 0000000000000000
[  898.132209] R13: ffffea0000dfade0 R14: 0000160000000000 R15: 0000000000000001
[  898.132209] FS:  00007fd4bd0ec740(0000) GS:ffff88003fc00000(0000)
knlGS:0000000000000000
[  898.132209] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[  898.132209] CR2: 00007fd4b387d000 CR3: 000000003920a000 CR4: 0000000000042660
[  898.132209] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  898.132209] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  898.132209] Process kworker/0:1 (pid: 3338, threadinfo ffff880037b4c000, task
ffff8800398fe180)
[  898.132209] Stack:
[  898.132209]  0000000000037eb7 0000000000000001 ffffffff8286c540
0000000000000001
[  898.132209]  0000000000000000 0000000000007ff0 ffff880037b4dd20
ffffffff81e42a60
[  898.132209]  ffff88003799c6c0 ffff88003fc16700 ffff88003fc0e000
ffff880037b4dd90
[  898.132209] Call Trace:
[  898.132209]  [<ffffffff8107fb8f>] process_one_work+0x1bf/0x4a0
[  898.132209]  [<ffffffff8107fb30>] ? process_one_work+0x160/0x4a0
[  898.132209]  [<ffffffff81849191>] ? __schedule+0x471/0x8a0
[  898.132209]  [<ffffffff8133aed0>] ? decrease_reservation+0x2d0/0x2d0
[  898.132209]  [<ffffffff81080252>] worker_thread+0x152/0x470
[  898.132209]  [<ffffffff8184ad85>] ?
_raw_spin_unlock_irqrestore+0x75/0xa0
[  898.132209]  [<ffffffff810ae4dd>] ? trace_hardirqs_on+0xd/0x10
[  898.132209]  [<ffffffff8184ad63>] ?
_raw_spin_unlock_irqrestore+0x53/0xa0
[  898.132209]  [<ffffffff81080100>] ? manage_workers+0x290/0x290
[  898.132209]  [<ffffffff81087696>] kthread+0x96/0xa0
[  898.132209]  [<ffffffff8184cb84>] kernel_thread_helper+0x4/0x10
[  898.132209]  [<ffffffff8184b134>] ? retint_restore_args+0x13/0x13
[  898.132209]  [<ffffffff8184cb80>] ? gs_change+0x13/0x13
[  898.132209] Code: ff 0f 1f 40 00 48 89 d8 e9 22 fe ff ff 0f 0b eb fe 48 89 d7
48 89 55 a0 e8 18 e7 cc ff 48 83 f8 ff 48 8b 55 a0 0f 84 74 fe ff ff <0f>
0b eb fe 66 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 89 d6
[  898.132209] RIP  [<ffffffff8133b206>] balloon_process+0x336/0x340
[  898.132209]  RSP <ffff880037b4dce0>
[  898.738233] ---[ end trace 3f7af50285edb7bb ]---
[  898.749003] BUG: unable to handle kernel paging request at fffffffffffffff8
[  898.752237] IP: [<ffffffff81086eeb>] kthread_data+0xb/0x20
[  898.752237] PGD 1e0d067 PUD 1e0e067 PMD 0 
[  898.752237] Oops: 0000 [#2] PREEMPT SMP 
[  898.752237] Modules linked in:
[  898.752237] CPU 0 
[  898.752237] Pid: 3338, comm: kworker/0:1 Tainted: G      D     
3.6.0-rc4-20120830+ #66 System manufacturer System Product Name/P5Q-EM DO
[  898.752237] RIP: e030:[<ffffffff81086eeb>]  [<ffffffff81086eeb>]
kthread_data+0xb/0x20
[  898.752237] RSP: e02b:ffff880037b4d898  EFLAGS: 00010082
[  898.752237] RAX: 0000000000000000 RBX: ffff88003fc12e80 RCX: 0000000000000000
[  898.752237] RDX: ffffffff820057a0 RSI: 0000000000000000 RDI: ffff8800398fe180
[  898.752237] RBP: ffff880037b4d898 R08: ffff8800398fe1f0 R09: 0000000000000400
[  898.752237] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[  898.752237] R13: 0000000000000000 R14: ffff880037b4d7b8 R15: ffff880037b4da90
[  898.752237] FS:  00007fd4bd0ec740(0000) GS:ffff88003fc00000(0000)
knlGS:0000000000000000
[  898.752237] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[  898.752237] CR2: fffffffffffffff8 CR3: 000000003920a000 CR4: 0000000000042660
[  898.752237] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  898.752237] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  898.752237] Process kworker/0:1 (pid: 3338, threadinfo ffff880037b4c000, task
ffff8800398fe180)
[  898.752237] Stack:
[  898.752237]  ffff880037b4d8c8 ffffffff8108306c 0000000000000000
ffff88003fc12e80
[  898.752237]  0000000000000000 ffff8800398fe528 ffff880037b4da18
ffffffff8184931f
[  898.752237]  0000000000000000 ffffffff81083bb8 ffff8800398fe180
0000000000012e80
[  898.752237] Call Trace:
[  898.752237]  [<ffffffff8108306c>] wq_worker_sleeping+0x1c/0x90
[  898.752237]  [<ffffffff8184931f>] __schedule+0x5ff/0x8a0
[  898.752237]  [<ffffffff81083bb8>] ? free_pid+0x18/0xc0
[  898.752237]  [<ffffffff810602a7>] ? sha1_transform_ssse3+0x187/0xd00
[  898.752237]  [<ffffffff810b1a94>] ? lock_acquire+0xe4/0x110
[  898.752237]  [<ffffffff8106cfa7>] ? do_exit+0x4e7/0x8e0
[  898.752237]  [<ffffffff810e27f2>] ? call_rcu+0x12/0x20
[  898.752237]  [<ffffffff810b1f01>] ? lock_release+0x111/0x260
[  898.752237]  [<ffffffff81849654>] schedule+0x24/0x70
[  898.752237]  [<ffffffff8106d074>] do_exit+0x5b4/0x8e0
[  898.752237]  [<ffffffff81010240>] oops_end+0xb0/0xf0
[  898.752237]  [<ffffffff810103b6>] die+0x56/0x90
[  898.752237]  [<ffffffff8100d6c4>] do_trap+0xc4/0x170
[  898.752237]  [<ffffffff8100dbe2>] ? do_invalid_op+0x72/0xc0
[  898.752237]  [<ffffffff8100dc16>] do_invalid_op+0xa6/0xc0
[  898.752237]  [<ffffffff8133b206>] ? balloon_process+0x336/0x340
[  898.752237]  [<ffffffff810ac9e8>] ?
trace_hardirqs_off_caller+0x78/0x150
[  898.752237]  [<ffffffff812b29fd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  898.752237]  [<ffffffff8184b164>] ? restore_args+0x30/0x30
[  898.752237]  [<ffffffff8184c9fb>] invalid_op+0x1b/0x20
[  898.752237]  [<ffffffff8133b206>] ? balloon_process+0x336/0x340
[  898.752237]  [<ffffffff8107fb8f>] process_one_work+0x1bf/0x4a0
[  898.752237]  [<ffffffff8107fb30>] ? process_one_work+0x160/0x4a0
[  898.752237]  [<ffffffff81849191>] ? __schedule+0x471/0x8a0
[  898.752237]  [<ffffffff8133aed0>] ? decrease_reservation+0x2d0/0x2d0
[  898.752237]  [<ffffffff81080252>] worker_thread+0x152/0x470
[  898.752237]  [<ffffffff8184ad85>] ?
_raw_spin_unlock_irqrestore+0x75/0xa0
[  898.752237]  [<ffffffff810ae4dd>] ? trace_hardirqs_on+0xd/0x10
[  898.752237]  [<ffffffff8184ad63>] ?
_raw_spin_unlock_irqrestore+0x53/0xa0
[  898.752237]  [<ffffffff81080100>] ? manage_workers+0x290/0x290
[  898.752237]  [<ffffffff81087696>] kthread+0x96/0xa0
[  898.752237]  [<ffffffff8184cb84>] kernel_thread_helper+0x4/0x10
[  898.752237]  [<ffffffff8184b134>] ? retint_restore_args+0x13/0x13
[  898.752237]  [<ffffffff8184cb80>] ? gs_change+0x13/0x13
[  898.752237] Code: 55 65 48 8b 04 25 80 c6 00 00 48 8b 80 50 03 00 00 48 89 e5
8b 40 f0 c9 c3 0f 1f 80 00 00 00 00 48 8b 87 50 03 00 00 55 48 89 e5 <48>
8b 40 f8 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
[  898.752237] RIP  [<ffffffff81086eeb>] kthread_data+0xb/0x20
[  898.752237]  RSP <ffff880037b4d898>
[  898.752237] CR2: fffffffffffffff8
[  898.752237] ---[ end trace 3f7af50285edb7bc ]---
[  898.752237] Fixing recursive fault but reboot is needed!
[  912.746625] xen_bridge: port 1(vif1.0) entered forwarding state

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2012-Sep-04 16:39 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

On Tue, Sep 04, 2012 at 06:37:57PM +0200, Sander Eikelenboom
wrote:> Hi Konrad,
> 
> This seems to happen only on a intel machine i''m trying to setup
as a development machine (haven''t seen it on my amd).
> It boots fine, i have dom0_mem=1024M,max:1024M set, the machine has 2G of
mem.
> 
> Dom0 and guest kernel are 3.6.0-rc4 with config:
> [*] Xen memory balloon driver
> [*]   Scrub pages before returning them to system
Can you also try this patch out and provide the full log (bootup and such).
Thanks!

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 31ab82f..871a93c 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -355,8 +355,12 @@ static enum bp_state increase_reservation(unsigned long
nr_pages)
 		BUG_ON(page == NULL);
 
 		pfn = page_to_pfn(page);
-		BUG_ON(!xen_feature(XENFEAT_auto_translated_physmap) &&
-		       phys_to_machine_mapping_valid(pfn));
+		if (!xen_feature(XENFEAT_auto_translated_physmap)) {
+			if (phys_to_machine_mapping_valid(pfn)) {
+				printk(KERN_DEBUG "%lx is %lx!\n", pfn,
get_phys_to_machine(pfn));
+				continue;
+			}
+		}
 
 		set_phys_to_machine(pfn, frame_list[i]);
 
@@ -572,6 +576,7 @@ static void __init balloon_add_region(unsigned long
start_pfn,
 	 */
 	extra_pfn_end = min(max_pfn, start_pfn + pages);
 
+	printk(KERN_INFO "%s: [%lx->%lx]\n", __func__, start_pfn,
extra_pfn_end);
 	for (pfn = start_pfn; pfn < extra_pfn_end; pfn++) {
 		page = pfn_to_page(pfn);
 		/* totalram_pages and totalhigh_pages do not

Sander Eikelenboom

2012-Sep-04 17:19 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Tuesday, September 4, 2012, 6:33:47 PM, you wrote:
> On Tue, Sep 04, 2012 at 06:37:57PM +0200, Sander Eikelenboom wrote:
>> Hi Konrad,
>> 
>> This seems to happen only on a intel machine i''m trying to
setup as a development machine (haven''t seen it on my amd).
>> It boots fine, i have dom0_mem=1024M,max:1024M set, the machine has 2G
of mem.
> Is this only with Xen 4.2? As, does Xen 4.1 work?
>> 
>> Dom0 and guest kernel are 3.6.0-rc4 with config:
> If you back out:
> f393387d160211f60398d58463a7e65
> Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Date:   Fri Aug 17 16:43:28 2012 -0400
>     xen/setup: Fix one-off error when adding for-balloon PFNs to the P2M.
> Do you see this bug? (Either with Xen 4.1 or Xen 4.2)?
With c96aae1f7f393387d160211f60398d58463a7e65 reverted i still see this bug
(with Xen 4.2).

Will use the debug patch you mailed and send back the results ...

>> [*] Xen memory balloon driver
>> [*]   Scrub pages before returning them to system
>> 
>> From
http://wiki.xen.org/wiki/Do%EF%BB%BFm0_Memory_%E2%80%94_Where_It_Has_Not_Gone ,
I thought this should be okay
>> 
>> But when trying to start a PV guest with 512MB mem, the machine (dom0)
crashes with the stacktrace below (complete serial-log.txt attached).
>> 
>> From the:
>> "mapping kernel into physical memory
>> about to get started..."
>> 
>> I would almost say it''s trying to reload dom0 ?
>> 
>> 
>> [  897.161119] device vif1.0 entered promiscuous mode
>> mapping kernel into physical memory
>> about to get started...
>> [  897.696619] xen_bridge: port 1(vif1.0) entered forwarding state
>> [  897.716219] xen_bridge: port 1(vif1.0) entered forwarding state
>> [  898.129465] ------------[ cut here ]------------
>> [  898.132209] kernel BUG at drivers/xen/balloon.c:359!
>> [  898.132209] invalid opcode: 0000 [#1] PREEMPT SMP

Konrad Rzeszutek Wilk

2012-Sep-04 17:58 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

On Tue, Sep 04, 2012 at 08:02:41PM +0200, Sander Eikelenboom
wrote:> 
> Tuesday, September 4, 2012, 6:39:03 PM, you wrote:
> 
> > On Tue, Sep 04, 2012 at 06:37:57PM +0200, Sander Eikelenboom wrote:
> >> Hi Konrad,
> >> 
> >> This seems to happen only on a intel machine i''m trying
to setup as a development machine (haven''t seen it on my amd).
> >> It boots fine, i have dom0_mem=1024M,max:1024M set, the machine
has 2G of mem.
> >> 
> >> Dom0 and guest kernel are 3.6.0-rc4 with config:
> >> [*] Xen memory balloon driver
> >> [*]   Scrub pages before returning them to system
> 
> > Can you also try this patch out and provide the full log (bootup and
such). Thanks!
> 
> After applying this patch and due to the removal of the BUG_ON the domU
boots and is reachable by SSH.
> Serial log attached.
Wow. That is a lot of .. And if you use Xen 4.1 it works fine?

Sander Eikelenboom

2012-Sep-04 18:02 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Tuesday, September 4, 2012, 6:39:03 PM, you wrote:
> On Tue, Sep 04, 2012 at 06:37:57PM +0200, Sander Eikelenboom wrote:
>> Hi Konrad,
>> 
>> This seems to happen only on a intel machine i''m trying to
setup as a development machine (haven''t seen it on my amd).
>> It boots fine, i have dom0_mem=1024M,max:1024M set, the machine has 2G
of mem.
>> 
>> Dom0 and guest kernel are 3.6.0-rc4 with config:
>> [*] Xen memory balloon driver
>> [*]   Scrub pages before returning them to system
> Can you also try this patch out and provide the full log (bootup and such).
Thanks!
After applying this patch and due to the removal of the BUG_ON the domU boots
and is reachable by SSH.
Serial log attached.


> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
> index 31ab82f..871a93c 100644
> --- a/drivers/xen/balloon.c
> +++ b/drivers/xen/balloon.c
> @@ -355,8 +355,12 @@ static enum bp_state increase_reservation(unsigned
long nr_pages)
>                 BUG_ON(page == NULL);
>  
>                 pfn = page_to_pfn(page);
> -               BUG_ON(!xen_feature(XENFEAT_auto_translated_physmap)
&&
> -                      phys_to_machine_mapping_valid(pfn));
> +               if (!xen_feature(XENFEAT_auto_translated_physmap)) {
> +                       if (phys_to_machine_mapping_valid(pfn)) {
> +                               printk(KERN_DEBUG "%lx is
%lx!\n", pfn, get_phys_to_machine(pfn));
> +                               continue;
> +                       }
> +               }
>  
>                 set_phys_to_machine(pfn, frame_list[i]);
>  
> @@ -572,6 +576,7 @@ static void __init balloon_add_region(unsigned long
start_pfn,
>          */
>         extra_pfn_end = min(max_pfn, start_pfn + pages);
>  
> +       printk(KERN_INFO "%s: [%lx->%lx]\n", __func__,
start_pfn, extra_pfn_end);
>         for (pfn = start_pfn; pfn < extra_pfn_end; pfn++) {
>                 page = pfn_to_page(pfn);
>                 /* totalram_pages and totalhigh_pages do not

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ben Guthro

2012-Sep-04 18:07 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

We ran into the same issue, in newer kernels - but had not yet
submitted this fix.

One of the developers here came up with a fix (attached, and CC''ed
here) that fixes an issue where the p2m code reuses a structure member
where it shouldn''t.
The patch adds a new "old_mfn"  member to the gnttab_map_grant_ref
structure, instead of re-using  dev_bus_addr.


If this also works for you, I can re-submit it with a Signed-off-by
line, if you prefer, Konrad.

Ben


On Tue, Sep 4, 2012 at 1:19 PM, Sander Eikelenboom <linux@eikelenboom.it>
wrote:>
> Tuesday, September 4, 2012, 6:33:47 PM, you wrote:
>
>> On Tue, Sep 04, 2012 at 06:37:57PM +0200, Sander Eikelenboom wrote:
>>> Hi Konrad,
>>>
>>> This seems to happen only on a intel machine i''m trying to
setup as a development machine (haven''t seen it on my amd).
>>> It boots fine, i have dom0_mem=1024M,max:1024M set, the machine has
2G of mem.
>
>> Is this only with Xen 4.2? As, does Xen 4.1 work?
>>>
>>> Dom0 and guest kernel are 3.6.0-rc4 with config:
>
>> If you back out:
>
>> f393387d160211f60398d58463a7e65
>> Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> Date:   Fri Aug 17 16:43:28 2012 -0400
>
>>     xen/setup: Fix one-off error when adding for-balloon PFNs to the
P2M.
>
>> Do you see this bug? (Either with Xen 4.1 or Xen 4.2)?
>
> With c96aae1f7f393387d160211f60398d58463a7e65 reverted i still see this bug
(with Xen 4.2).
>
> Will use the debug patch you mailed and send back the results ...
>
>
>>> [*] Xen memory balloon driver
>>> [*]   Scrub pages before returning them to system
>>>
>>> From
http://wiki.xen.org/wiki/Do%EF%BB%BFm0_Memory_%E2%80%94_Where_It_Has_Not_Gone ,
I thought this should be okay
>>>
>>> But when trying to start a PV guest with 512MB mem, the machine
(dom0) crashes with the stacktrace below (complete serial-log.txt attached).
>>>
>>> From the:
>>> "mapping kernel into physical memory
>>> about to get started..."
>>>
>>> I would almost say it''s trying to reload dom0 ?
>>>
>>>
>>> [  897.161119] device vif1.0 entered promiscuous mode
>>> mapping kernel into physical memory
>>> about to get started...
>>> [  897.696619] xen_bridge: port 1(vif1.0) entered forwarding state
>>> [  897.716219] xen_bridge: port 1(vif1.0) entered forwarding state
>>> [  898.129465] ------------[ cut here ]------------
>>> [  898.132209] kernel BUG at drivers/xen/balloon.c:359!
>>> [  898.132209] invalid opcode: 0000 [#1] PREEMPT SMP
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2012-Sep-04 18:22 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

On Tue, Sep 04, 2012 at 02:07:11PM -0400, Ben Guthro
wrote:> We ran into the same issue, in newer kernels - but had not yet
> submitted this fix.
> 
> One of the developers here came up with a fix (attached, and CC''ed
> here) that fixes an issue where the p2m code reuses a structure member
> where it shouldn''t.
> The patch adds a new "old_mfn"  member to the
gnttab_map_grant_ref
> structure, instead of re-using  dev_bus_addr.
Wow. So that implies the m2p code had some new wonkiness in it.

Perhaps this b9e0d95c041ca2d7ad297ee37c2e9cfab67a188f
or
0930bba674e248b921ea659b036ff02564e5a5f4

both courtesy of Stefano (who is on vacation this week :-())
are at fault?

Would it be possible to revert one of them (or both) and see if the
issues disappear?
> 
> 
> If this also works for you, I can re-submit it with a Signed-off-by
> line, if you prefer, Konrad.
> 
> Ben
> 
> 
> On Tue, Sep 4, 2012 at 1:19 PM, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
> >
> > Tuesday, September 4, 2012, 6:33:47 PM, you wrote:
> >
> >> On Tue, Sep 04, 2012 at 06:37:57PM +0200, Sander Eikelenboom
wrote:
> >>> Hi Konrad,
> >>>
> >>> This seems to happen only on a intel machine i''m
trying to setup as a development machine (haven''t seen it on my amd).
> >>> It boots fine, i have dom0_mem=1024M,max:1024M set, the
machine has 2G of mem.
> >
> >> Is this only with Xen 4.2? As, does Xen 4.1 work?
> >>>
> >>> Dom0 and guest kernel are 3.6.0-rc4 with config:
> >
> >> If you back out:
> >
> >> f393387d160211f60398d58463a7e65
> >> Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >> Date:   Fri Aug 17 16:43:28 2012 -0400
> >
> >>     xen/setup: Fix one-off error when adding for-balloon PFNs to
the P2M.
> >
> >> Do you see this bug? (Either with Xen 4.1 or Xen 4.2)?
> >
> > With c96aae1f7f393387d160211f60398d58463a7e65 reverted i still see
this bug (with Xen 4.2).
> >
> > Will use the debug patch you mailed and send back the results ...
> >
> >
> >>> [*] Xen memory balloon driver
> >>> [*]   Scrub pages before returning them to system
> >>>
> >>> From
http://wiki.xen.org/wiki/Do%EF%BB%BFm0_Memory_%E2%80%94_Where_It_Has_Not_Gone ,
I thought this should be okay
> >>>
> >>> But when trying to start a PV guest with 512MB mem, the
machine (dom0) crashes with the stacktrace below (complete serial-log.txt
attached).
> >>>
> >>> From the:
> >>> "mapping kernel into physical memory
> >>> about to get started..."
> >>>
> >>> I would almost say it''s trying to reload dom0 ?
> >>>
> >>>
> >>> [  897.161119] device vif1.0 entered promiscuous mode
> >>> mapping kernel into physical memory
> >>> about to get started...
> >>> [  897.696619] xen_bridge: port 1(vif1.0) entered forwarding
state
> >>> [  897.716219] xen_bridge: port 1(vif1.0) entered forwarding
state
> >>> [  898.129465] ------------[ cut here ]------------
> >>> [  898.132209] kernel BUG at drivers/xen/balloon.c:359!
> >>> [  898.132209] invalid opcode: 0000 [#1] PREEMPT SMP
> >
> >
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel

Sander Eikelenboom

2012-Sep-04 18:57 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Tuesday, September 4, 2012, 8:22:41 PM, you wrote:
> On Tue, Sep 04, 2012 at 02:07:11PM -0400, Ben Guthro wrote:
>> We ran into the same issue, in newer kernels - but had not yet
>> submitted this fix.
>> 
>> One of the developers here came up with a fix (attached, and
CC''ed
>> here) that fixes an issue where the p2m code reuses a structure member
>> where it shouldn''t.
>> The patch adds a new "old_mfn"  member to the
gnttab_map_grant_ref
>> structure, instead of re-using  dev_bus_addr.
> Wow. So that implies the m2p code had some new wonkiness in it.
> Perhaps this b9e0d95c041ca2d7ad297ee37c2e9cfab67a188f
> or
> 0930bba674e248b921ea659b036ff02564e5a5f4
> both courtesy of Stefano (who is on vacation this week :-())
> are at fault?
> Would it be possible to revert one of them (or both) and see if the
> issues disappear?
reverting b9e0d95c041ca2d7ad297ee37c2e9cfab67a188f didn''t help

reverting 0930bba674e248b921ea659b036ff02564e5a5f4 didn''t work out due
to a lot of merge conflicts :S
>> 
>> 
>> If this also works for you, I can re-submit it with a Signed-off-by
>> line, if you prefer, Konrad.
>> 
>> Ben
>> 
>> 
>> On Tue, Sep 4, 2012 at 1:19 PM, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>> >
>> > Tuesday, September 4, 2012, 6:33:47 PM, you wrote:
>> >
>> >> On Tue, Sep 04, 2012 at 06:37:57PM +0200, Sander Eikelenboom
wrote:
>> >>> Hi Konrad,
>> >>>
>> >>> This seems to happen only on a intel machine i''m
trying to setup as a development machine (haven''t seen it on my amd).
>> >>> It boots fine, i have dom0_mem=1024M,max:1024M set, the
machine has 2G of mem.
>> >
>> >> Is this only with Xen 4.2? As, does Xen 4.1 work?
>> >>>
>> >>> Dom0 and guest kernel are 3.6.0-rc4 with config:
>> >
>> >> If you back out:
>> >
>> >> f393387d160211f60398d58463a7e65
>> >> Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> >> Date:   Fri Aug 17 16:43:28 2012 -0400
>> >
>> >>     xen/setup: Fix one-off error when adding for-balloon PFNs
to the P2M.
>> >
>> >> Do you see this bug? (Either with Xen 4.1 or Xen 4.2)?
>> >
>> > With c96aae1f7f393387d160211f60398d58463a7e65 reverted i still see
this bug (with Xen 4.2).
>> >
>> > Will use the debug patch you mailed and send back the results ...
>> >
>> >
>> >>> [*] Xen memory balloon driver
>> >>> [*]   Scrub pages before returning them to system
>> >>>
>> >>> From
http://wiki.xen.org/wiki/Do%EF%BB%BFm0_Memory_%E2%80%94_Where_It_Has_Not_Gone ,
I thought this should be okay
>> >>>
>> >>> But when trying to start a PV guest with 512MB mem, the
machine (dom0) crashes with the stacktrace below (complete serial-log.txt
attached).
>> >>>
>> >>> From the:
>> >>> "mapping kernel into physical memory
>> >>> about to get started..."
>> >>>
>> >>> I would almost say it''s trying to reload dom0 ?
>> >>>
>> >>>
>> >>> [  897.161119] device vif1.0 entered promiscuous mode
>> >>> mapping kernel into physical memory
>> >>> about to get started...
>> >>> [  897.696619] xen_bridge: port 1(vif1.0) entered
forwarding state
>> >>> [  897.716219] xen_bridge: port 1(vif1.0) entered
forwarding state
>> >>> [  898.129465] ------------[ cut here ]------------
>> >>> [  898.132209] kernel BUG at drivers/xen/balloon.c:359!
>> >>> [  898.132209] invalid opcode: 0000 [#1] PREEMPT SMP
>> >
>> >
>> >
>> > _______________________________________________
>> > Xen-devel mailing list
>> > Xen-devel@lists.xen.org
>> > http://lists.xen.org/xen-devel

Sander Eikelenboom

2012-Sep-04 19:01 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Tuesday, September 4, 2012, 7:58:41 PM, you wrote:
> On Tue, Sep 04, 2012 at 08:02:41PM +0200, Sander Eikelenboom wrote:
>> 
>> Tuesday, September 4, 2012, 6:39:03 PM, you wrote:
>> 
>> > On Tue, Sep 04, 2012 at 06:37:57PM +0200, Sander Eikelenboom
wrote:
>> >> Hi Konrad,
>> >> 
>> >> This seems to happen only on a intel machine i''m
trying to setup as a development machine (haven''t seen it on my amd).
>> >> It boots fine, i have dom0_mem=1024M,max:1024M set, the
machine has 2G of mem.
>> >> 
>> >> Dom0 and guest kernel are 3.6.0-rc4 with config:
>> >> [*] Xen memory balloon driver
>> >> [*]   Scrub pages before returning them to system
>> 
>> > Can you also try this patch out and provide the full log (bootup
and such). Thanks!
>> 
>> After applying this patch and due to the removal of the BUG_ON the domU
boots and is reachable by SSH.
>> Serial log attached.
> Wow. That is a lot of .. And if you use Xen 4.1 it works fine?
Uhmm don''t know, didn''t use this machine for a while, doing
things like writing a master thesis :)
Upgraded xen and kernel from 2.6.36 kernel and xen 4.0something to 3.6.0-rc4 and
4.2-rc4

Trying to make it work and try to make some xen patches :-p
But i seem to be stumbling over quite a lot of things with both machines(amd and
intel) while going to xen 4.2 (and from xm to xl)

Sander Eikelenboom

2012-Sep-04 19:34 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Tuesday, September 4, 2012, 8:07:11 PM, you wrote:
> We ran into the same issue, in newer kernels - but had not yet
> submitted this fix.
> One of the developers here came up with a fix (attached, and CC''ed
> here) that fixes an issue where the p2m code reuses a structure member
> where it shouldn''t.
> The patch adds a new "old_mfn"  member to the
gnttab_map_grant_ref
> structure, instead of re-using  dev_bus_addr.
> If this also works for you, I can re-submit it with a Signed-off-by
> line, if you prefer, Konrad.
Hi Ben,

This patch doesn''t work for me:

When starting the PV-guest i get:

(XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map op
(68b69070).
(XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map op (0).
(XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map op (0).


and from the dom0 kernel:

[  374.425727] BUG: unable to handle kernel paging request at ffff8800fffd9078
[  374.428901] IP: [<ffffffff81336e4e>] gnttab_map_refs+0x14e/0x270
[  374.428901] PGD 1e0c067 PUD 0
[  374.428901] Oops: 0000 [#1] PREEMPT SMP
[  374.428901] Modules linked in:
[  374.428901] CPU 0
[  374.428901] Pid: 4308, comm: qemu-system-i38 Not tainted 3.6.0-rc4-20120830+
#70 System manufacturer System Product Name/P5Q-EM DO
[  374.428901] RIP: e030:[<ffffffff81336e4e>]  [<ffffffff81336e4e>]
gnttab_map_refs+0x14e/0x270
[  374.428901] RSP: e02b:ffff88002f185ca8  EFLAGS: 00010206
[  374.428901] RAX: ffff880000000000 RBX: ffff88001471cf00 RCX: 00000000fffd9078
[  374.428901] RDX: 0000000000000050 RSI: 40000000000fffd9 RDI: 00003ffffffff000
[  374.428901] RBP: ffff88002f185d08 R08: 0000000000000078 R09: 0000000000000000
[  374.428901] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004
[  374.428901] R13: ffff88001471c480 R14: 0000000000000002 R15: 0000000000000002
[  374.428901] FS:  00007f6def9f2740(0000) GS:ffff88003fc00000(0000)
knlGS:0000000000000000
[  374.428901] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[  374.428901] CR2: ffff8800fffd9078 CR3: 000000002d30e000 CR4: 0000000000042660
[  374.428901] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  374.428901] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  374.428901] Process qemu-system-i38 (pid: 4308, threadinfo ffff88002f184000,
task ffff8800376f1040)
[  374.428901] Stack:
[  374.428901]  ffffffffffffffff 0000000000000050 00000000fffd9078
00000000000fffd9
[  374.428901]  0000000001000000 ffff8800382135a0 ffff88002f185d08
ffff880038211960
[  374.428901]  ffff88002f11d2c0 0000000000000004 0000000000000003
0000000000000001
[  374.428901] Call Trace:
[  374.428901]  [<ffffffff8134212e>] gntdev_mmap+0x20e/0x520
[  374.428901]  [<ffffffff8111c502>] ? mmap_region+0x312/0x5a0
[  374.428901]  [<ffffffff810ae0a0>] ? lockdep_trace_alloc+0xa0/0x130
[  374.428901]  [<ffffffff8111c5be>] mmap_region+0x3ce/0x5a0
[  374.428901]  [<ffffffff8111c9e0>] do_mmap_pgoff+0x250/0x350
[  374.428901]  [<ffffffff81109e88>] vm_mmap_pgoff+0x68/0x90
[  374.428901]  [<ffffffff8111a5b2>] sys_mmap_pgoff+0x152/0x170
[  374.428901]  [<ffffffff812b29be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[  374.428901]  [<ffffffff81011f29>] sys_mmap+0x29/0x30
[  374.428901]  [<ffffffff8184b939>] system_call_fastpath+0x16/0x1b
[  374.428901] Code: 0f 84 e7 00 00 00 48 89 f1 48 c1 e1 0c 41 81 e0 ff 0f 00 00
48 b8 00 00 00 00 00 88 ff ff 48 bf 00 f0 ff ff ff 3f 00 00 4c 01 c1 <48>
23 3c 01 48 c1 ef 0c 49 8d 54 15 00 4d 85 ed b8 00 00 00 00
[  374.428901] RIP  [<ffffffff81336e4e>] gnttab_map_refs+0x14e/0x270
[  374.428901]  RSP <ffff88002f185ca8>
[  374.428901] CR2: ffff8800fffd9078
[  374.428901] ---[ end trace 0e0a5a49f6503c0a ]---


> Ben
> On Tue, Sep 4, 2012 at 1:19 PM, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>>
>> Tuesday, September 4, 2012, 6:33:47 PM, you wrote:
>>
>>> On Tue, Sep 04, 2012 at 06:37:57PM +0200, Sander Eikelenboom wrote:
>>>> Hi Konrad,
>>>>
>>>> This seems to happen only on a intel machine i''m
trying to setup as a development machine (haven''t seen it on my amd).
>>>> It boots fine, i have dom0_mem=1024M,max:1024M set, the machine
has 2G of mem.
>>
>>> Is this only with Xen 4.2? As, does Xen 4.1 work?
>>>>
>>>> Dom0 and guest kernel are 3.6.0-rc4 with config:
>>
>>> If you back out:
>>
>>> f393387d160211f60398d58463a7e65
>>> Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>> Date:   Fri Aug 17 16:43:28 2012 -0400
>>
>>>     xen/setup: Fix one-off error when adding for-balloon PFNs to
the P2M.
>>
>>> Do you see this bug? (Either with Xen 4.1 or Xen 4.2)?
>>
>> With c96aae1f7f393387d160211f60398d58463a7e65 reverted i still see this
bug (with Xen 4.2).
>>
>> Will use the debug patch you mailed and send back the results ...
>>
>>
>>>> [*] Xen memory balloon driver
>>>> [*]   Scrub pages before returning them to system
>>>>
>>>> From
http://wiki.xen.org/wiki/Do%EF%BB%BFm0_Memory_%E2%80%94_Where_It_Has_Not_Gone ,
I thought this should be okay
>>>>
>>>> But when trying to start a PV guest with 512MB mem, the machine
(dom0) crashes with the stacktrace below (complete serial-log.txt attached).
>>>>
>>>> From the:
>>>> "mapping kernel into physical memory
>>>> about to get started..."
>>>>
>>>> I would almost say it''s trying to reload dom0 ?
>>>>
>>>>
>>>> [  897.161119] device vif1.0 entered promiscuous mode
>>>> mapping kernel into physical memory
>>>> about to get started...
>>>> [  897.696619] xen_bridge: port 1(vif1.0) entered forwarding
state
>>>> [  897.716219] xen_bridge: port 1(vif1.0) entered forwarding
state
>>>> [  898.129465] ------------[ cut here ]------------
>>>> [  898.132209] kernel BUG at drivers/xen/balloon.c:359!
>>>> [  898.132209] invalid opcode: 0000 [#1] PREEMPT SMP
>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel

Sander Eikelenboom

2012-Sep-04 20:13 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Tuesday, September 4, 2012, 7:58:41 PM, you wrote:
> On Tue, Sep 04, 2012 at 08:02:41PM +0200, Sander Eikelenboom wrote:
>> 
>> Tuesday, September 4, 2012, 6:39:03 PM, you wrote:
>> 
>> > On Tue, Sep 04, 2012 at 06:37:57PM +0200, Sander Eikelenboom
wrote:
>> >> Hi Konrad,
>> >> 
>> >> This seems to happen only on a intel machine i''m
trying to setup as a development machine (haven''t seen it on my amd).
>> >> It boots fine, i have dom0_mem=1024M,max:1024M set, the
machine has 2G of mem.
>> >> 
>> >> Dom0 and guest kernel are 3.6.0-rc4 with config:
>> >> [*] Xen memory balloon driver
>> >> [*]   Scrub pages before returning them to system
>> 
>> > Can you also try this patch out and provide the full log (bootup
and such). Thanks!
>> 
>> After applying this patch and due to the removal of the BUG_ON the domU
boots and is reachable by SSH.
>> Serial log attached.
> Wow. That is a lot of .. And if you use Xen 4.1 it works fine?
Ok 3.5.3 crashes as well .. will see what xen 4.1.3 does with both kernels on
this machine ...

Robert Phillips

2012-Sep-04 20:27 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Ben,

You have asked me to provide the rationale behind the gnttab_old_mfn patch,
which you emailed to Sander earlier today.
Here are my findings.

I found that xen_blkbk_map() in drivers/block/xen-blkback/blkback.c has changed
from our previous version.  It now calls gnttab_map_refs() in
drivers/xen/grant-table.c.

That function first calls HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, ...
) and then calls m2p_add_override() in p2m.c
which is where I made my change.

The unpatched code was saving the pfn''s old mfn in
kmap_op->dev_bus_addr.

kmap_op is of type struct gnttab_map_grant_ref.  That data type is used to
record grant table mappings so later they can be unmapped correctly.

The problem with saving the old mfn in kmap_op->dev_bus_addr is that it is
later overwritten by __gnttab_map_grant_ref() in xen/common/grant_table.c

Since the storage holding the old mfn got overwritten, the unmapping was being
done incorrectly.  The balloon code detected that and bugged at
drivers/xen/balloon.c:359

My patch simply adds another member called old_mfn to struct
gnttab_map_grant_ref rather than trying to overload dev_bus_addr.

I don''t know if Sander''s bug is the same or related.  The
BUG_ON at drivers/xen/balloon.c:359 is quite general.  It simply asserts that we
are not trying to re-map a valid mapping.

-- Robert Phillips


-----Original Message-----
From: Sander Eikelenboom [mailto:linux@eikelenboom.it] 
Sent: Tuesday, September 04, 2012 3:35 PM
To: Ben Guthro
Cc: Konrad Rzeszutek Wilk; xen-devel@lists.xen.org; Robert Phillips
Subject: Re: [Xen-devel] dom0 linux 3.6.0-rc4, crash due to ballooning althoug
dom0_mem=X, max:X set


Tuesday, September 4, 2012, 8:07:11 PM, you wrote:
> We ran into the same issue, in newer kernels - but had not yet
> submitted this fix.
> One of the developers here came up with a fix (attached, and CC''ed
> here) that fixes an issue where the p2m code reuses a structure member
> where it shouldn''t.
> The patch adds a new "old_mfn"  member to the
gnttab_map_grant_ref
> structure, instead of re-using  dev_bus_addr.
> If this also works for you, I can re-submit it with a Signed-off-by
> line, if you prefer, Konrad.
Hi Ben,

This patch doesn''t work for me:

When starting the PV-guest i get:

(XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map op
(68b69070).
(XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map op (0).
(XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map op (0).


and from the dom0 kernel:

[  374.425727] BUG: unable to handle kernel paging request at ffff8800fffd9078
[  374.428901] IP: [<ffffffff81336e4e>] gnttab_map_refs+0x14e/0x270
[  374.428901] PGD 1e0c067 PUD 0
[  374.428901] Oops: 0000 [#1] PREEMPT SMP
[  374.428901] Modules linked in:
[  374.428901] CPU 0
[  374.428901] Pid: 4308, comm: qemu-system-i38 Not tainted 3.6.0-rc4-20120830+
#70 System manufacturer System Product Name/P5Q-EM DO
[  374.428901] RIP: e030:[<ffffffff81336e4e>]  [<ffffffff81336e4e>]
gnttab_map_refs+0x14e/0x270
[  374.428901] RSP: e02b:ffff88002f185ca8  EFLAGS: 00010206
[  374.428901] RAX: ffff880000000000 RBX: ffff88001471cf00 RCX: 00000000fffd9078
[  374.428901] RDX: 0000000000000050 RSI: 40000000000fffd9 RDI: 00003ffffffff000
[  374.428901] RBP: ffff88002f185d08 R08: 0000000000000078 R09: 0000000000000000
[  374.428901] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004
[  374.428901] R13: ffff88001471c480 R14: 0000000000000002 R15: 0000000000000002
[  374.428901] FS:  00007f6def9f2740(0000) GS:ffff88003fc00000(0000)
knlGS:0000000000000000
[  374.428901] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[  374.428901] CR2: ffff8800fffd9078 CR3: 000000002d30e000 CR4: 0000000000042660
[  374.428901] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  374.428901] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  374.428901] Process qemu-system-i38 (pid: 4308, threadinfo ffff88002f184000,
task ffff8800376f1040)
[  374.428901] Stack:
[  374.428901]  ffffffffffffffff 0000000000000050 00000000fffd9078
00000000000fffd9
[  374.428901]  0000000001000000 ffff8800382135a0 ffff88002f185d08
ffff880038211960
[  374.428901]  ffff88002f11d2c0 0000000000000004 0000000000000003
0000000000000001
[  374.428901] Call Trace:
[  374.428901]  [<ffffffff8134212e>] gntdev_mmap+0x20e/0x520
[  374.428901]  [<ffffffff8111c502>] ? mmap_region+0x312/0x5a0
[  374.428901]  [<ffffffff810ae0a0>] ? lockdep_trace_alloc+0xa0/0x130
[  374.428901]  [<ffffffff8111c5be>] mmap_region+0x3ce/0x5a0
[  374.428901]  [<ffffffff8111c9e0>] do_mmap_pgoff+0x250/0x350
[  374.428901]  [<ffffffff81109e88>] vm_mmap_pgoff+0x68/0x90
[  374.428901]  [<ffffffff8111a5b2>] sys_mmap_pgoff+0x152/0x170
[  374.428901]  [<ffffffff812b29be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[  374.428901]  [<ffffffff81011f29>] sys_mmap+0x29/0x30
[  374.428901]  [<ffffffff8184b939>] system_call_fastpath+0x16/0x1b
[  374.428901] Code: 0f 84 e7 00 00 00 48 89 f1 48 c1 e1 0c 41 81 e0 ff 0f 00 00
48 b8 00 00 00 00 00 88 ff ff 48 bf 00 f0 ff ff ff 3f 00 00 4c 01 c1 <48>
23 3c 01 48 c1 ef 0c 49 8d 54 15 00 4d 85 ed b8 00 00 00 00
[  374.428901] RIP  [<ffffffff81336e4e>] gnttab_map_refs+0x14e/0x270
[  374.428901]  RSP <ffff88002f185ca8>
[  374.428901] CR2: ffff8800fffd9078
[  374.428901] ---[ end trace 0e0a5a49f6503c0a ]---


> Ben
> On Tue, Sep 4, 2012 at 1:19 PM, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>>
>> Tuesday, September 4, 2012, 6:33:47 PM, you wrote:
>>
>>> On Tue, Sep 04, 2012 at 06:37:57PM +0200, Sander Eikelenboom wrote:
>>>> Hi Konrad,
>>>>
>>>> This seems to happen only on a intel machine i''m
trying to setup as a development machine (haven''t seen it on my amd).
>>>> It boots fine, i have dom0_mem=1024M,max:1024M set, the machine
has 2G of mem.
>>
>>> Is this only with Xen 4.2? As, does Xen 4.1 work?
>>>>
>>>> Dom0 and guest kernel are 3.6.0-rc4 with config:
>>
>>> If you back out:
>>
>>> f393387d160211f60398d58463a7e65
>>> Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>> Date:   Fri Aug 17 16:43:28 2012 -0400
>>
>>>     xen/setup: Fix one-off error when adding for-balloon PFNs to
the P2M.
>>
>>> Do you see this bug? (Either with Xen 4.1 or Xen 4.2)?
>>
>> With c96aae1f7f393387d160211f60398d58463a7e65 reverted i still see this
bug (with Xen 4.2).
>>
>> Will use the debug patch you mailed and send back the results ...
>>
>>
>>>> [*] Xen memory balloon driver
>>>> [*]   Scrub pages before returning them to system
>>>>
>>>> From
http://wiki.xen.org/wiki/Do%EF%BB%BFm0_Memory_%E2%80%94_Where_It_Has_Not_Gone ,
I thought this should be okay
>>>>
>>>> But when trying to start a PV guest with 512MB mem, the machine
(dom0) crashes with the stacktrace below (complete serial-log.txt attached).
>>>>
>>>> From the:
>>>> "mapping kernel into physical memory
>>>> about to get started..."
>>>>
>>>> I would almost say it''s trying to reload dom0 ?
>>>>
>>>>
>>>> [  897.161119] device vif1.0 entered promiscuous mode
>>>> mapping kernel into physical memory
>>>> about to get started...
>>>> [  897.696619] xen_bridge: port 1(vif1.0) entered forwarding
state
>>>> [  897.716219] xen_bridge: port 1(vif1.0) entered forwarding
state
>>>> [  898.129465] ------------[ cut here ]------------
>>>> [  898.132209] kernel BUG at drivers/xen/balloon.c:359!
>>>> [  898.132209] invalid opcode: 0000 [#1] PREEMPT SMP
>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel

Sander Eikelenboom

2012-Sep-04 21:23 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Tuesday, September 4, 2012, 7:58:41 PM, you wrote:
> On Tue, Sep 04, 2012 at 08:02:41PM +0200, Sander Eikelenboom wrote:
>> 
>> Tuesday, September 4, 2012, 6:39:03 PM, you wrote:
>> 
>> > On Tue, Sep 04, 2012 at 06:37:57PM +0200, Sander Eikelenboom
wrote:
>> >> Hi Konrad,
>> >> 
>> >> This seems to happen only on a intel machine i''m
trying to setup as a development machine (haven''t seen it on my amd).
>> >> It boots fine, i have dom0_mem=1024M,max:1024M set, the
machine has 2G of mem.
>> >> 
>> >> Dom0 and guest kernel are 3.6.0-rc4 with config:
>> >> [*] Xen memory balloon driver
>> >> [*]   Scrub pages before returning them to system
>> 
>> > Can you also try this patch out and provide the full log (bootup
and such). Thanks!
>> 
>> After applying this patch and due to the removal of the BUG_ON the domU
boots and is reachable by SSH.
>> Serial log attached.
> Wow. That is a lot of .. And if you use Xen 4.1 it works fine?
Ok .. to sum it up after todays compile day :-p

- xen-4.2.0-rc4-pre + linux 3.6-rc4 -> BUG_ON on start PV guest
- xen-4.2.0-rc4-pre + linux 3.5.3   -> BUG_ON on start PV guest
- xen-4.1.4-pre     + linux 3.5.3   -> BUG_ON on start PV guest
- xen-4.1.4-pre     + linux 3.4.1   -> Works OK
- xen-4.2.0-rc4-pre + linux 3.6-rc4 -> Works, BUG_ON removed by patch
(http://lists.xen.org/archives/html/xen-devel/2012-09/msg00142.html)

Konrad Rzeszutek Wilk

2012-Sep-05 14:06 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

On Tue, Sep 04, 2012 at 04:27:20PM -0400, Robert Phillips
wrote:> Ben,
> 
> You have asked me to provide the rationale behind the gnttab_old_mfn patch,
which you emailed to Sander earlier today.
> Here are my findings.
> 
> I found that xen_blkbk_map() in drivers/block/xen-blkback/blkback.c has
changed from our previous version.  It now calls gnttab_map_refs() in
drivers/xen/grant-table.c.
> 
> That function first calls HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,
... ) and then calls m2p_add_override() in p2m.c
And HYPERVISOR_grant_table_op .. would populate map_ops[i].bus_addr with the
machine address..
> which is where I made my change.
> 
> The unpatched code was saving the pfn''s old mfn in
kmap_op->dev_bus_addr.
> 
> kmap_op is of type struct gnttab_map_grant_ref.  That data type is used to
record grant table mappings so later they can be unmapped correctly.
Right, but the blkback makes a distinction by passing NULL as kmap_op, which
means it should
use the old mechanism. Meaning that once the hypercall is done, the
map_ops[i].bus_addr is not
used anymore..
> 
> The problem with saving the old mfn in kmap_op->dev_bus_addr is that it
is later overwritten by __gnttab_map_grant_ref() in xen/common/grant_table.c
Uh, so the problem of saving the old mfn in dev_bus_addr has been there for a
long long time then?
Even before this patch set?> 
> Since the storage holding the old mfn got overwritten, the unmapping was
being done incorrectly.  The balloon code detected that and bugged at
drivers/xen/balloon.c:359
> 
Hmm, I believe the storage for holding the old mfn was/is page->index.

> My patch simply adds another member called old_mfn to struct
gnttab_map_grant_ref rather than trying to overload dev_bus_addr.
> 
> I don''t know if Sander''s bug is the same or related.  The
BUG_ON at drivers/xen/balloon.c:359 is quite general.  It simply asserts that we
are not trying to re-map a valid mapping.
Right. Somehow he ends up with valid mappings where there should be none. And
lots of them.> 
> -- Robert Phillips
> 
> 
> -----Original Message-----
> From: Sander Eikelenboom [mailto:linux@eikelenboom.it] 
> Sent: Tuesday, September 04, 2012 3:35 PM
> To: Ben Guthro
> Cc: Konrad Rzeszutek Wilk; xen-devel@lists.xen.org; Robert Phillips
> Subject: Re: [Xen-devel] dom0 linux 3.6.0-rc4, crash due to ballooning
althoug dom0_mem=X, max:X set
> 
> 
> Tuesday, September 4, 2012, 8:07:11 PM, you wrote:
> 
> > We ran into the same issue, in newer kernels - but had not yet
> > submitted this fix.
> 
> > One of the developers here came up with a fix (attached, and
CC''ed
> > here) that fixes an issue where the p2m code reuses a structure member
> > where it shouldn''t.
> > The patch adds a new "old_mfn"  member to the
gnttab_map_grant_ref
> > structure, instead of re-using  dev_bus_addr.
> 
> 
> > If this also works for you, I can re-submit it with a Signed-off-by
> > line, if you prefer, Konrad.
> 
> Hi Ben,
> 
> This patch doesn''t work for me:
> 
> When starting the PV-guest i get:
> 
> (XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map op
(68b69070).
> (XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map op
(0).
> (XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map op
(0).
> 
> 
> and from the dom0 kernel:
> 
> [  374.425727] BUG: unable to handle kernel paging request at
ffff8800fffd9078
> [  374.428901] IP: [<ffffffff81336e4e>] gnttab_map_refs+0x14e/0x270
> [  374.428901] PGD 1e0c067 PUD 0
> [  374.428901] Oops: 0000 [#1] PREEMPT SMP
> [  374.428901] Modules linked in:
> [  374.428901] CPU 0
> [  374.428901] Pid: 4308, comm: qemu-system-i38 Not tainted
3.6.0-rc4-20120830+ #70 System manufacturer System Product Name/P5Q-EM DO
> [  374.428901] RIP: e030:[<ffffffff81336e4e>] 
[<ffffffff81336e4e>] gnttab_map_refs+0x14e/0x270
> [  374.428901] RSP: e02b:ffff88002f185ca8  EFLAGS: 00010206
> [  374.428901] RAX: ffff880000000000 RBX: ffff88001471cf00 RCX:
00000000fffd9078
> [  374.428901] RDX: 0000000000000050 RSI: 40000000000fffd9 RDI:
00003ffffffff000
> [  374.428901] RBP: ffff88002f185d08 R08: 0000000000000078 R09:
0000000000000000
> [  374.428901] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000004
> [  374.428901] R13: ffff88001471c480 R14: 0000000000000002 R15:
0000000000000002
> [  374.428901] FS:  00007f6def9f2740(0000) GS:ffff88003fc00000(0000)
knlGS:0000000000000000
> [  374.428901] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  374.428901] CR2: ffff8800fffd9078 CR3: 000000002d30e000 CR4:
0000000000042660
> [  374.428901] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
> [  374.428901] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
> [  374.428901] Process qemu-system-i38 (pid: 4308, threadinfo
ffff88002f184000, task ffff8800376f1040)
> [  374.428901] Stack:
> [  374.428901]  ffffffffffffffff 0000000000000050 00000000fffd9078
00000000000fffd9
> [  374.428901]  0000000001000000 ffff8800382135a0 ffff88002f185d08
ffff880038211960
> [  374.428901]  ffff88002f11d2c0 0000000000000004 0000000000000003
0000000000000001
> [  374.428901] Call Trace:
> [  374.428901]  [<ffffffff8134212e>] gntdev_mmap+0x20e/0x520
> [  374.428901]  [<ffffffff8111c502>] ? mmap_region+0x312/0x5a0
> [  374.428901]  [<ffffffff810ae0a0>] ? lockdep_trace_alloc+0xa0/0x130
> [  374.428901]  [<ffffffff8111c5be>] mmap_region+0x3ce/0x5a0
> [  374.428901]  [<ffffffff8111c9e0>] do_mmap_pgoff+0x250/0x350
> [  374.428901]  [<ffffffff81109e88>] vm_mmap_pgoff+0x68/0x90
> [  374.428901]  [<ffffffff8111a5b2>] sys_mmap_pgoff+0x152/0x170
> [  374.428901]  [<ffffffff812b29be>] ?
trace_hardirqs_on_thunk+0x3a/0x3f
> [  374.428901]  [<ffffffff81011f29>] sys_mmap+0x29/0x30
> [  374.428901]  [<ffffffff8184b939>] system_call_fastpath+0x16/0x1b
> [  374.428901] Code: 0f 84 e7 00 00 00 48 89 f1 48 c1 e1 0c 41 81 e0 ff 0f
00 00 48 b8 00 00 00 00 00 88 ff ff 48 bf 00 f0 ff ff ff 3f 00 00 4c 01 c1
<48> 23 3c 01 48 c1 ef 0c 49 8d 54 15 00 4d 85 ed b8 00 00 00 00
> [  374.428901] RIP  [<ffffffff81336e4e>] gnttab_map_refs+0x14e/0x270
> [  374.428901]  RSP <ffff88002f185ca8>
> [  374.428901] CR2: ffff8800fffd9078
> [  374.428901] ---[ end trace 0e0a5a49f6503c0a ]---
> 
> 
> 
> > Ben
> 
> 
> > On Tue, Sep 4, 2012 at 1:19 PM, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
> >>
> >> Tuesday, September 4, 2012, 6:33:47 PM, you wrote:
> >>
> >>> On Tue, Sep 04, 2012 at 06:37:57PM +0200, Sander Eikelenboom
wrote:
> >>>> Hi Konrad,
> >>>>
> >>>> This seems to happen only on a intel machine i''m
trying to setup as a development machine (haven''t seen it on my amd).
> >>>> It boots fine, i have dom0_mem=1024M,max:1024M set, the
machine has 2G of mem.
> >>
> >>> Is this only with Xen 4.2? As, does Xen 4.1 work?
> >>>>
> >>>> Dom0 and guest kernel are 3.6.0-rc4 with config:
> >>
> >>> If you back out:
> >>
> >>> f393387d160211f60398d58463a7e65
> >>> Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >>> Date:   Fri Aug 17 16:43:28 2012 -0400
> >>
> >>>     xen/setup: Fix one-off error when adding for-balloon PFNs
to the P2M.
> >>
> >>> Do you see this bug? (Either with Xen 4.1 or Xen 4.2)?
> >>
> >> With c96aae1f7f393387d160211f60398d58463a7e65 reverted i still see
this bug (with Xen 4.2).
> >>
> >> Will use the debug patch you mailed and send back the results ...
> >>
> >>
> >>>> [*] Xen memory balloon driver
> >>>> [*]   Scrub pages before returning them to system
> >>>>
> >>>> From
http://wiki.xen.org/wiki/Do%EF%BB%BFm0_Memory_%E2%80%94_Where_It_Has_Not_Gone ,
I thought this should be okay
> >>>>
> >>>> But when trying to start a PV guest with 512MB mem, the
machine (dom0) crashes with the stacktrace below (complete serial-log.txt
attached).
> >>>>
> >>>> From the:
> >>>> "mapping kernel into physical memory
> >>>> about to get started..."
> >>>>
> >>>> I would almost say it''s trying to reload dom0 ?
> >>>>
> >>>>
> >>>> [  897.161119] device vif1.0 entered promiscuous mode
> >>>> mapping kernel into physical memory
> >>>> about to get started...
> >>>> [  897.696619] xen_bridge: port 1(vif1.0) entered
forwarding state
> >>>> [  897.716219] xen_bridge: port 1(vif1.0) entered
forwarding state
> >>>> [  898.129465] ------------[ cut here ]------------
> >>>> [  898.132209] kernel BUG at drivers/xen/balloon.c:359!
> >>>> [  898.132209] invalid opcode: 0000 [#1] PREEMPT SMP
> >>
> >>
> >>
> >> _______________________________________________
> >> Xen-devel mailing list
> >> Xen-devel@lists.xen.org
> >> http://lists.xen.org/xen-devel

Sander Eikelenboom

2012-Sep-05 14:38 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Wednesday, September 5, 2012, 4:06:01 PM, you wrote:
> On Tue, Sep 04, 2012 at 04:27:20PM -0400, Robert Phillips wrote:
>> Ben,
>> 
>> You have asked me to provide the rationale behind the gnttab_old_mfn
patch, which you emailed to Sander earlier today.
>> Here are my findings.
>> 
>> I found that xen_blkbk_map() in drivers/block/xen-blkback/blkback.c has
changed from our previous version.  It now calls gnttab_map_refs() in
drivers/xen/grant-table.c.
>> 
>> That function first calls
HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, ... ) and then calls
m2p_add_override() in p2m.c
> And HYPERVISOR_grant_table_op .. would populate map_ops[i].bus_addr with
the machine address..
>> which is where I made my change.
>> 
>> The unpatched code was saving the pfn''s old mfn in
kmap_op->dev_bus_addr.
>> 
>> kmap_op is of type struct gnttab_map_grant_ref.  That data type is used
to record grant table mappings so later they can be unmapped correctly.
> Right, but the blkback makes a distinction by passing NULL as kmap_op,
which means it should
> use the old mechanism. Meaning that once the hypercall is done, the
map_ops[i].bus_addr is not
> used anymore..
>> 
>> The problem with saving the old mfn in kmap_op->dev_bus_addr is that
it is later overwritten by __gnttab_map_grant_ref() in xen/common/grant_table.c
> Uh, so the problem of saving the old mfn in dev_bus_addr has been there for
a long long time then?
> Even before this patch set?
>> 
>> Since the storage holding the old mfn got overwritten, the unmapping
was being done incorrectly.  The balloon code detected that and bugged at
drivers/xen/balloon.c:359
>> 
> Hmm, I believe the storage for holding the old mfn was/is page->index.
>> My patch simply adds another member called old_mfn to struct
gnttab_map_grant_ref rather than trying to overload dev_bus_addr.
>> 
>> I don''t know if Sander''s bug is the same or related. 
The BUG_ON at drivers/xen/balloon.c:359 is quite general.  It simply asserts
that we are not trying to re-map a valid mapping.
> Right. Somehow he ends up with valid mappings where there should be none.
And lots of them.
It''s something between kernel v3.4.1 and v3.5.3, haven''t had
time to narrow it down yet.
Any suggestions for specific commits i could try to quickly bisect this one ?

>> 
>> -- Robert Phillips
>> 
>> 
>> -----Original Message-----
>> From: Sander Eikelenboom [mailto:linux@eikelenboom.it] 
>> Sent: Tuesday, September 04, 2012 3:35 PM
>> To: Ben Guthro
>> Cc: Konrad Rzeszutek Wilk; xen-devel@lists.xen.org; Robert Phillips
>> Subject: Re: [Xen-devel] dom0 linux 3.6.0-rc4, crash due to ballooning
althoug dom0_mem=X, max:X set
>> 
>> 
>> Tuesday, September 4, 2012, 8:07:11 PM, you wrote:
>> 
>> > We ran into the same issue, in newer kernels - but had not yet
>> > submitted this fix.
>> 
>> > One of the developers here came up with a fix (attached, and
CC''ed
>> > here) that fixes an issue where the p2m code reuses a structure
member
>> > where it shouldn''t.
>> > The patch adds a new "old_mfn"  member to the
gnttab_map_grant_ref
>> > structure, instead of re-using  dev_bus_addr.
>> 
>> 
>> > If this also works for you, I can re-submit it with a
Signed-off-by
>> > line, if you prefer, Konrad.
>> 
>> Hi Ben,
>> 
>> This patch doesn''t work for me:
>> 
>> When starting the PV-guest i get:
>> 
>> (XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map
op (68b69070).
>> (XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map
op (0).
>> (XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map
op (0).
>> 
>> 
>> and from the dom0 kernel:
>> 
>> [  374.425727] BUG: unable to handle kernel paging request at
ffff8800fffd9078
>> [  374.428901] IP: [<ffffffff81336e4e>]
gnttab_map_refs+0x14e/0x270
>> [  374.428901] PGD 1e0c067 PUD 0
>> [  374.428901] Oops: 0000 [#1] PREEMPT SMP
>> [  374.428901] Modules linked in:
>> [  374.428901] CPU 0
>> [  374.428901] Pid: 4308, comm: qemu-system-i38 Not tainted
3.6.0-rc4-20120830+ #70 System manufacturer System Product Name/P5Q-EM DO
>> [  374.428901] RIP: e030:[<ffffffff81336e4e>] 
[<ffffffff81336e4e>] gnttab_map_refs+0x14e/0x270
>> [  374.428901] RSP: e02b:ffff88002f185ca8  EFLAGS: 00010206
>> [  374.428901] RAX: ffff880000000000 RBX: ffff88001471cf00 RCX:
00000000fffd9078
>> [  374.428901] RDX: 0000000000000050 RSI: 40000000000fffd9 RDI:
00003ffffffff000
>> [  374.428901] RBP: ffff88002f185d08 R08: 0000000000000078 R09:
0000000000000000
>> [  374.428901] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000004
>> [  374.428901] R13: ffff88001471c480 R14: 0000000000000002 R15:
0000000000000002
>> [  374.428901] FS:  00007f6def9f2740(0000) GS:ffff88003fc00000(0000)
knlGS:0000000000000000
>> [  374.428901] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [  374.428901] CR2: ffff8800fffd9078 CR3: 000000002d30e000 CR4:
0000000000042660
>> [  374.428901] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
>> [  374.428901] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
>> [  374.428901] Process qemu-system-i38 (pid: 4308, threadinfo
ffff88002f184000, task ffff8800376f1040)
>> [  374.428901] Stack:
>> [  374.428901]  ffffffffffffffff 0000000000000050 00000000fffd9078
00000000000fffd9
>> [  374.428901]  0000000001000000 ffff8800382135a0 ffff88002f185d08
ffff880038211960
>> [  374.428901]  ffff88002f11d2c0 0000000000000004 0000000000000003
0000000000000001
>> [  374.428901] Call Trace:
>> [  374.428901]  [<ffffffff8134212e>] gntdev_mmap+0x20e/0x520
>> [  374.428901]  [<ffffffff8111c502>] ? mmap_region+0x312/0x5a0
>> [  374.428901]  [<ffffffff810ae0a0>] ?
lockdep_trace_alloc+0xa0/0x130
>> [  374.428901]  [<ffffffff8111c5be>] mmap_region+0x3ce/0x5a0
>> [  374.428901]  [<ffffffff8111c9e0>] do_mmap_pgoff+0x250/0x350
>> [  374.428901]  [<ffffffff81109e88>] vm_mmap_pgoff+0x68/0x90
>> [  374.428901]  [<ffffffff8111a5b2>] sys_mmap_pgoff+0x152/0x170
>> [  374.428901]  [<ffffffff812b29be>] ?
trace_hardirqs_on_thunk+0x3a/0x3f
>> [  374.428901]  [<ffffffff81011f29>] sys_mmap+0x29/0x30
>> [  374.428901]  [<ffffffff8184b939>]
system_call_fastpath+0x16/0x1b
>> [  374.428901] Code: 0f 84 e7 00 00 00 48 89 f1 48 c1 e1 0c 41 81 e0 ff
0f 00 00 48 b8 00 00 00 00 00 88 ff ff 48 bf 00 f0 ff ff ff 3f 00 00 4c 01 c1
<48> 23 3c 01 48 c1 ef 0c 49 8d 54 15 00 4d 85 ed b8 00 00 00 00
>> [  374.428901] RIP  [<ffffffff81336e4e>]
gnttab_map_refs+0x14e/0x270
>> [  374.428901]  RSP <ffff88002f185ca8>
>> [  374.428901] CR2: ffff8800fffd9078
>> [  374.428901] ---[ end trace 0e0a5a49f6503c0a ]---
>> 
>> 
>> 
>> > Ben
>> 
>> 
>> > On Tue, Sep 4, 2012 at 1:19 PM, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>> >>
>> >> Tuesday, September 4, 2012, 6:33:47 PM, you wrote:
>> >>
>> >>> On Tue, Sep 04, 2012 at 06:37:57PM +0200, Sander
Eikelenboom wrote:
>> >>>> Hi Konrad,
>> >>>>
>> >>>> This seems to happen only on a intel machine
i''m trying to setup as a development machine (haven''t seen it
on my amd).
>> >>>> It boots fine, i have dom0_mem=1024M,max:1024M set,
the machine has 2G of mem.
>> >>
>> >>> Is this only with Xen 4.2? As, does Xen 4.1 work?
>> >>>>
>> >>>> Dom0 and guest kernel are 3.6.0-rc4 with config:
>> >>
>> >>> If you back out:
>> >>
>> >>> f393387d160211f60398d58463a7e65
>> >>> Author: Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com>
>> >>> Date:   Fri Aug 17 16:43:28 2012 -0400
>> >>
>> >>>     xen/setup: Fix one-off error when adding for-balloon
PFNs to the P2M.
>> >>
>> >>> Do you see this bug? (Either with Xen 4.1 or Xen 4.2)?
>> >>
>> >> With c96aae1f7f393387d160211f60398d58463a7e65 reverted i still
see this bug (with Xen 4.2).
>> >>
>> >> Will use the debug patch you mailed and send back the results
...
>> >>
>> >>
>> >>>> [*] Xen memory balloon driver
>> >>>> [*]   Scrub pages before returning them to system
>> >>>>
>> >>>> From
http://wiki.xen.org/wiki/Do%EF%BB%BFm0_Memory_%E2%80%94_Where_It_Has_Not_Gone ,
I thought this should be okay
>> >>>>
>> >>>> But when trying to start a PV guest with 512MB mem,
the machine (dom0) crashes with the stacktrace below (complete serial-log.txt
attached).
>> >>>>
>> >>>> From the:
>> >>>> "mapping kernel into physical memory
>> >>>> about to get started..."
>> >>>>
>> >>>> I would almost say it''s trying to reload dom0
?
>> >>>>
>> >>>>
>> >>>> [  897.161119] device vif1.0 entered promiscuous mode
>> >>>> mapping kernel into physical memory
>> >>>> about to get started...
>> >>>> [  897.696619] xen_bridge: port 1(vif1.0) entered
forwarding state
>> >>>> [  897.716219] xen_bridge: port 1(vif1.0) entered
forwarding state
>> >>>> [  898.129465] ------------[ cut here ]------------
>> >>>> [  898.132209] kernel BUG at
drivers/xen/balloon.c:359!
>> >>>> [  898.132209] invalid opcode: 0000 [#1] PREEMPT SMP
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> Xen-devel mailing list
>> >> Xen-devel@lists.xen.org
>> >> http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2012-Sep-05 20:19 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

On Wed, Sep 05, 2012 at 04:38:48PM +0200, Sander Eikelenboom
wrote:> 
> Wednesday, September 5, 2012, 4:06:01 PM, you wrote:
> 
> > On Tue, Sep 04, 2012 at 04:27:20PM -0400, Robert Phillips wrote:
> >> Ben,
> >> 
> >> You have asked me to provide the rationale behind the
gnttab_old_mfn patch, which you emailed to Sander earlier today.
> >> Here are my findings.
> >> 
> >> I found that xen_blkbk_map() in
drivers/block/xen-blkback/blkback.c has changed from our previous version.  It
now calls gnttab_map_refs() in drivers/xen/grant-table.c.
> >> 
> >> That function first calls
HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, ... ) and then calls
m2p_add_override() in p2m.c
> 
> > And HYPERVISOR_grant_table_op .. would populate map_ops[i].bus_addr
with the machine address..
> 
> >> which is where I made my change.
> >> 
> >> The unpatched code was saving the pfn''s old mfn in
kmap_op->dev_bus_addr.
> >> 
> >> kmap_op is of type struct gnttab_map_grant_ref.  That data type is
used to record grant table mappings so later they can be unmapped correctly.
> 
> > Right, but the blkback makes a distinction by passing NULL as kmap_op,
which means it should
> > use the old mechanism. Meaning that once the hypercall is done, the
map_ops[i].bus_addr is not
> > used anymore..
> 
> >> 
> >> The problem with saving the old mfn in kmap_op->dev_bus_addr is
that it is later overwritten by __gnttab_map_grant_ref() in
xen/common/grant_table.c
> 
> > Uh, so the problem of saving the old mfn in dev_bus_addr has been
there for a long long time then?
> > Even before this patch set?
> >> 
> >> Since the storage holding the old mfn got overwritten, the
unmapping was being done incorrectly.  The balloon code detected that and bugged
at drivers/xen/balloon.c:359
> >> 
> 
> > Hmm, I believe the storage for holding the old mfn was/is
page->index.
> 
> 
> >> My patch simply adds another member called old_mfn to struct
gnttab_map_grant_ref rather than trying to overload dev_bus_addr.
> >> 
> >> I don''t know if Sander''s bug is the same or
related.  The BUG_ON at drivers/xen/balloon.c:359 is quite general.  It simply
asserts that we are not trying to re-map a valid mapping.
> 
> > Right. Somehow he ends up with valid mappings where there should be
none. And lots of them.
> 
> It''s something between kernel v3.4.1 and v3.5.3, haven''t
had time to narrow it down yet.
> Any suggestions for specific commits i could try to quickly bisect this one
?
These are the ones that went in:

ea61fc0 xen/p2m: Reserve 8MB of _brk space for P2M leafs when populating back.
b9e0d95 xen: mark local pages as FOREIGN in the m2p_override
6878c32 xen/blkfront: Add WARN to deal with misbehaving backends.
5e62625 xen/setup: filter APERFMPERF cpuid feature out
8c9ce60 xen/blkback: Copy id field when doing BLKIF_DISCARD.
58b7b53 xen/balloon: Subtract from xen_released_pages the count that is
populated.
780dbcd xen/pci: Check for PCI bridge before using it.
5e152e6 xen/events: Add WARN_ON when quick lookup found invalid type.
5842f57 xen/hvc: Check HVM_PARAM_CONSOLE_[EVTCHN|PFN] for correctness.
a32c88b xen/hvc: Fix error cases around HVM_PARAM_CONSOLE_PFN
2e5ad6b xen/hvc: Collapse error logic.
7664810 xen: do not disable netfront in dom0
68c2c39 xen: do not map the same GSI twice in PVHVM guests.
201a52b hvc_xen: NULL dereference on allocation failure
d79d595 xen: Add selfballoning memory reservation tunable.
d2fb4c5 xenbus: Add support for xenbus backend in stub domain
2f1bd67 xen/smp: unbind irqworkX when unplugging vCPUs.
87e4baa x86/xen/apic: Add missing #include <xen/xen.h>
323f90a xen-acpi-processor: Add missing #include <xen/xen.h>
8605067 xen-blkfront: module exit handling adjustments
e77c78c xen-blkfront: properly name all devices
f62805f xen: enter/exit lazy_mmu_mode around m2p_override calls
211063d xen/acpi/sleep: Enable ACPI sleep via the __acpi_os_prepare_sleep
1ff2b0c xen: implement IRQ_WORK_VECTOR handler
f447d56 xen: implement apic ipi interface
83d51ab xen/setup: update VA mapping when releasing memory during setup
96dc08b xen/setup: Combine the two hypercall functions - since they are quite
similar.
2e2fb75 xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to
E820 RAM
ca11823 xen/setup: Only print "Freeing XXX-YYY pfn range: Z pages
freed" if Z > 0
9438ef7 x86/apic: Fix UP boot crash
ab6ec39 xen/apic: implement io apic read with hypercall
27abd14 Revert "xen/x86: Workaround ''x86/ioapic: Add register
level checks to detect bogus io-apic entries''"
31b3c9d xen/x86: Implement x86_apic_ops
4a8e2a3 x86/apic: Replace io_apic_ops with x86_io_apic_ops.
977f857 PCI: move mutex locking out of pci_dev_reset function
569ca5b xen/gnttab: add deferred freeing logic
9fe2a70 debugfs: Add support to print u32 array in debugfs
940713b xen/p2m: An early bootup variant of set_phys_to_machine
d509685 xen/p2m: Collapse early_alloc_p2m_middle redundant checks.
cef4cca xen/p2m: Allow alloc_p2m_middle to call reserve_brk depending on
argument
3f3aaea xen/p2m: Move code around to allow for better re-usage.

Narrowing this down (so ignore APIC bootup, drivers, etc) these could be it:

b9e0d95 xen: mark local pages as FOREIGN in the m2p_override
58b7b53 xen/balloon: Subtract from xen_released_pages the count that is
populated.
d79d595 xen: Add selfballoning memory reservation tunable.
f62805f xen: enter/exit lazy_mmu_mode around m2p_override calls
83d51ab xen/setup: update VA mapping when releasing memory during setup
96dc08b xen/setup: Combine the two hypercall functions - since they are quite
similar.
2e2fb75 xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to
E820 RAM
ca11823 xen/setup: Only print "Freeing XXX-YYY pfn range: Z pages
freed" if Z > 0
940713b xen/p2m: An early bootup variant of set_phys_to_machine
d509685 xen/p2m: Collapse early_alloc_p2m_middle redundant checks.
cef4cca xen/p2m: Allow alloc_p2m_middle to call reserve_brk depending on
argument
3f3aaea xen/p2m: Move code around to allow for better re-usage.

About nine of them deal with dom0_mem=max ballooning up right, so if you
ignore those:

b9e0d95 xen: mark local pages as FOREIGN in the m2p_override
d79d595 xen: Add selfballoning memory reservation tunable.
f62805f xen: enter/exit lazy_mmu_mode around m2p_override calls

Try reverting any of those.

And if nothing works there then we can try to revert the ones that
deal with ''dom0_mem=max:XX''..

I also need to be able to reproduce this. You said you can only reproduce this
on your Intel box - is this a fast Intel machine? It also looks like you only
have 2GB in the machine - and reserve 1GB to the dom0.

If you manually (so don''t start the guest), balloon down - say to 512MB
and then launch
a guest do you see this problem?

Sander Eikelenboom

2012-Sep-05 22:52 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Wednesday, September 5, 2012, 10:19:33 PM, you wrote:
> On Wed, Sep 05, 2012 at 04:38:48PM +0200, Sander Eikelenboom wrote:
>> 
>> Wednesday, September 5, 2012, 4:06:01 PM, you wrote:
>> 
>> > On Tue, Sep 04, 2012 at 04:27:20PM -0400, Robert Phillips wrote:
>> >> Ben,
>> >> 
>> >> You have asked me to provide the rationale behind the
gnttab_old_mfn patch, which you emailed to Sander earlier today.
>> >> Here are my findings.
>> >> 
>> >> I found that xen_blkbk_map() in
drivers/block/xen-blkback/blkback.c has changed from our previous version.  It
now calls gnttab_map_refs() in drivers/xen/grant-table.c.
>> >> 
>> >> That function first calls
HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, ... ) and then calls
m2p_add_override() in p2m.c
>> 
>> > And HYPERVISOR_grant_table_op .. would populate
map_ops[i].bus_addr with the machine address..
>> 
>> >> which is where I made my change.
>> >> 
>> >> The unpatched code was saving the pfn''s old mfn in
kmap_op->dev_bus_addr.
>> >> 
>> >> kmap_op is of type struct gnttab_map_grant_ref.  That data
type is used to record grant table mappings so later they can be unmapped
correctly.
>> 
>> > Right, but the blkback makes a distinction by passing NULL as
kmap_op, which means it should
>> > use the old mechanism. Meaning that once the hypercall is done,
the map_ops[i].bus_addr is not
>> > used anymore..
>> 
>> >> 
>> >> The problem with saving the old mfn in
kmap_op->dev_bus_addr is that it is later overwritten by
__gnttab_map_grant_ref() in xen/common/grant_table.c
>> 
>> > Uh, so the problem of saving the old mfn in dev_bus_addr has been
there for a long long time then?
>> > Even before this patch set?
>> >> 
>> >> Since the storage holding the old mfn got overwritten, the
unmapping was being done incorrectly.  The balloon code detected that and bugged
at drivers/xen/balloon.c:359
>> >> 
>> 
>> > Hmm, I believe the storage for holding the old mfn was/is
page->index.
>> 
>> 
>> >> My patch simply adds another member called old_mfn to struct
gnttab_map_grant_ref rather than trying to overload dev_bus_addr.
>> >> 
>> >> I don''t know if Sander''s bug is the same or
related.  The BUG_ON at drivers/xen/balloon.c:359 is quite general.  It simply
asserts that we are not trying to re-map a valid mapping.
>> 
>> > Right. Somehow he ends up with valid mappings where there should
be none. And lots of them.
>> 
>> It''s something between kernel v3.4.1 and v3.5.3,
haven''t had time to narrow it down yet.
>> Any suggestions for specific commits i could try to quickly bisect this
one ?
> These are the ones that went in:
> ea61fc0 xen/p2m: Reserve 8MB of _brk space for P2M leafs when populating
back.
> b9e0d95 xen: mark local pages as FOREIGN in the m2p_override
> 6878c32 xen/blkfront: Add WARN to deal with misbehaving backends.
> 5e62625 xen/setup: filter APERFMPERF cpuid feature out
> 8c9ce60 xen/blkback: Copy id field when doing BLKIF_DISCARD.
> 58b7b53 xen/balloon: Subtract from xen_released_pages the count that is
populated.
> 780dbcd xen/pci: Check for PCI bridge before using it.
> 5e152e6 xen/events: Add WARN_ON when quick lookup found invalid type.
> 5842f57 xen/hvc: Check HVM_PARAM_CONSOLE_[EVTCHN|PFN] for correctness.
> a32c88b xen/hvc: Fix error cases around HVM_PARAM_CONSOLE_PFN
> 2e5ad6b xen/hvc: Collapse error logic.
> 7664810 xen: do not disable netfront in dom0
> 68c2c39 xen: do not map the same GSI twice in PVHVM guests.
> 201a52b hvc_xen: NULL dereference on allocation failure
> d79d595 xen: Add selfballoning memory reservation tunable.
> d2fb4c5 xenbus: Add support for xenbus backend in stub domain
> 2f1bd67 xen/smp: unbind irqworkX when unplugging vCPUs.
> 87e4baa x86/xen/apic: Add missing #include <xen/xen.h>
> 323f90a xen-acpi-processor: Add missing #include <xen/xen.h>
> 8605067 xen-blkfront: module exit handling adjustments
> e77c78c xen-blkfront: properly name all devices
> f62805f xen: enter/exit lazy_mmu_mode around m2p_override calls
> 211063d xen/acpi/sleep: Enable ACPI sleep via the __acpi_os_prepare_sleep
> 1ff2b0c xen: implement IRQ_WORK_VECTOR handler
> f447d56 xen: implement apic ipi interface
> 83d51ab xen/setup: update VA mapping when releasing memory during setup
> 96dc08b xen/setup: Combine the two hypercall functions - since they are
quite similar.
> 2e2fb75 xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps
to E820 RAM
> ca11823 xen/setup: Only print "Freeing XXX-YYY pfn range: Z pages
freed" if Z > 0
> 9438ef7 x86/apic: Fix UP boot crash
> ab6ec39 xen/apic: implement io apic read with hypercall
> 27abd14 Revert "xen/x86: Workaround ''x86/ioapic: Add register
level checks to detect bogus io-apic entries''"
> 31b3c9d xen/x86: Implement x86_apic_ops
> 4a8e2a3 x86/apic: Replace io_apic_ops with x86_io_apic_ops.
> 977f857 PCI: move mutex locking out of pci_dev_reset function
> 569ca5b xen/gnttab: add deferred freeing logic
> 9fe2a70 debugfs: Add support to print u32 array in debugfs
> 940713b xen/p2m: An early bootup variant of set_phys_to_machine
> d509685 xen/p2m: Collapse early_alloc_p2m_middle redundant checks.
> cef4cca xen/p2m: Allow alloc_p2m_middle to call reserve_brk depending on
argument
> 3f3aaea xen/p2m: Move code around to allow for better re-usage.
> Narrowing this down (so ignore APIC bootup, drivers, etc) these could be
it:
> b9e0d95 xen: mark local pages as FOREIGN in the m2p_override
> 58b7b53 xen/balloon: Subtract from xen_released_pages the count that is
populated.
> d79d595 xen: Add selfballoning memory reservation tunable.
> f62805f xen: enter/exit lazy_mmu_mode around m2p_override calls
> 83d51ab xen/setup: update VA mapping when releasing memory during setup
> 96dc08b xen/setup: Combine the two hypercall functions - since they are
quite similar.
> 2e2fb75 xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps
to E820 RAM
> ca11823 xen/setup: Only print "Freeing XXX-YYY pfn range: Z pages
freed" if Z > 0
> 940713b xen/p2m: An early bootup variant of set_phys_to_machine
> d509685 xen/p2m: Collapse early_alloc_p2m_middle redundant checks.
> cef4cca xen/p2m: Allow alloc_p2m_middle to call reserve_brk depending on
argument
> 3f3aaea xen/p2m: Move code around to allow for better re-usage.
> About nine of them deal with dom0_mem=max ballooning up right, so if you
> ignore those:
> b9e0d95 xen: mark local pages as FOREIGN in the m2p_override
> d79d595 xen: Add selfballoning memory reservation tunable.
> f62805f xen: enter/exit lazy_mmu_mode around m2p_override calls
> Try reverting any of those.
Ah i missed your email since my hostingprovider was down :-(
But anyway done a git bisect in the mean time that leads to:

[f62805f1f30a40e354bd036b4cb799863a39be4b] xen: enter/exit lazy_mmu_mode around
m2p_override calls

> And if nothing works there then we can try to revert the ones that
> deal with ''dom0_mem=max:XX''..
> I also need to be able to reproduce this. You said you can only reproduce
this
> on your Intel box - is this a fast Intel machine? It also looks like you
only
> have 2GB in the machine - and reserve 1GB to the dom0.
Machine is a quad core q9400 @ 2.66mhz, not very fast .. not very slow either
> If you manually (so don''t start the guest), balloon down - say to
512MB and then launch
> a guest do you see this problem?

Should i use

xl  mem-max domain-id mem

or

xl  mem-set domain-id mem

for that ?


Perhaps a silly question, but why is it ballooning anyway ?
I have set dom0''s memory and there is enough left to create the domain
... or at least there should be ...

--
Sander

Konrad Rzeszutek Wilk

2012-Sep-06 10:57 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

> > About nine of them deal with dom0_mem=max ballooning up right, so if
you
> > ignore those:
> 
> > b9e0d95 xen: mark local pages as FOREIGN in the m2p_override
> > d79d595 xen: Add selfballoning memory reservation tunable.
> > f62805f xen: enter/exit lazy_mmu_mode around m2p_override calls
> 
> > Try reverting any of those.
> 
> Ah i missed your email since my hostingprovider was down :-(
> But anyway done a git bisect in the mean time that leads to:
> 
> [f62805f1f30a40e354bd036b4cb799863a39be4b] xen: enter/exit lazy_mmu_mode
around m2p_override calls
OK. Hmm.that will take a bit of thinking to fix.> 
> 
> > And if nothing works there then we can try to revert the ones that
> > deal with ''dom0_mem=max:XX''..
> 
> > I also need to be able to reproduce this. You said you can only
reproduce this
> > on your Intel box - is this a fast Intel machine? It also looks like
you only
> > have 2GB in the machine - and reserve 1GB to the dom0.
> 
> Machine is a quad core q9400 @ 2.66mhz, not very fast .. not very slow
either
That is a fast machine. I was thinking you had a Core2 Solo or a Pentium IV
Prescott.
> 
> > If you manually (so don''t start the guest), balloon down -
say to 512MB and then launch
> > a guest do you see this problem?
> 
> 
> Should i use
> 
> xl  mem-max domain-id mem
> 
> or
> 
> xl  mem-set domain-id mem
The later.> 
> for that ?
> 
> 
> Perhaps a silly question, but why is it ballooning anyway ?
> I have set dom0''s memory and there is enough left to create the
domain ... or at least there should be ...
There was a bug in xl that would autoballoon. You can turn it off using some
xl.conf file.
> 
> --
> Sander

Sander Eikelenboom

2012-Sep-06 11:16 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Thursday, September 6, 2012, 12:57:46 PM, you wrote:
>> > About nine of them deal with dom0_mem=max ballooning up right, so
if you
>> > ignore those:
>> 
>> > b9e0d95 xen: mark local pages as FOREIGN in the m2p_override
>> > d79d595 xen: Add selfballoning memory reservation tunable.
>> > f62805f xen: enter/exit lazy_mmu_mode around m2p_override calls
>> 
>> > Try reverting any of those.
>> 
>> Ah i missed your email since my hostingprovider was down :-(
>> But anyway done a git bisect in the mean time that leads to:
>> 
>> [f62805f1f30a40e354bd036b4cb799863a39be4b] xen: enter/exit
lazy_mmu_mode around m2p_override calls
> OK. Hmm.that will take a bit of thinking to fix.
>> 
>> 
>> > And if nothing works there then we can try to revert the ones that
>> > deal with ''dom0_mem=max:XX''..
>> 
>> > I also need to be able to reproduce this. You said you can only
reproduce this
>> > on your Intel box - is this a fast Intel machine? It also looks
like you only
>> > have 2GB in the machine - and reserve 1GB to the dom0.
>> 
>> Machine is a quad core q9400 @ 2.66mhz, not very fast .. not very slow
either
> That is a fast machine. I was thinking you had a Core2 Solo or a Pentium IV
Prescott.
>> 
>> > If you manually (so don''t start the guest), balloon down
- say to 512MB and then launch
>> > a guest do you see this problem?
>> 
>> 
>> Should i use
>> 
>> xl  mem-max domain-id mem
>> 
>> or
>> 
>> xl  mem-set domain-id mem
Will test that shortly
> The later.
>> 
>> for that ?
>> 
>> 
>> Perhaps a silly question, but why is it ballooning anyway ?
>> I have set dom0''s memory and there is enough left to create
the domain ... or at least there should be ...
> There was a bug in xl that would autoballoon. You can turn it off using
some xl.conf file.
Should that have been fixed ?
       A) you describe it as a bug, so even without tinkering with the default
xl.conf, the tools shouldn''t be autoballooning when "dom0_mem=X,
max:X" is set right ?)
       B) or was it fixed by letting the user turn it off in xl.conf ?

If A .. That would make that the "bug" is still present in xen-4.2-rc4
...

Because i was under the impression the "dom0_mem=X, max:X" would
prevent the whole autoballooning stuff :-)

>> 
>> --
>> Sander

Sander Eikelenboom

2012-Sep-06 16:46 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Thursday, September 6, 2012, 12:57:46 PM, you wrote:
>> > About nine of them deal with dom0_mem=max ballooning up right, so
if you
>> > ignore those:
>> 
>> > b9e0d95 xen: mark local pages as FOREIGN in the m2p_override
>> > d79d595 xen: Add selfballoning memory reservation tunable.
>> > f62805f xen: enter/exit lazy_mmu_mode around m2p_override calls
>> 
>> > Try reverting any of those.
>> 
>> Ah i missed your email since my hostingprovider was down :-(
>> But anyway done a git bisect in the mean time that leads to:
>> 
>> [f62805f1f30a40e354bd036b4cb799863a39be4b] xen: enter/exit
lazy_mmu_mode around m2p_override calls
> OK. Hmm.that will take a bit of thinking to fix.
>> 
>> 
>> > And if nothing works there then we can try to revert the ones that
>> > deal with ''dom0_mem=max:XX''..
>> 
>> > I also need to be able to reproduce this. You said you can only
reproduce this
>> > on your Intel box - is this a fast Intel machine? It also looks
like you only
>> > have 2GB in the machine - and reserve 1GB to the dom0.
>> 
>> Machine is a quad core q9400 @ 2.66mhz, not very fast .. not very slow
either
> That is a fast machine. I was thinking you had a Core2 Solo or a Pentium IV
Prescott.
>> 
>> > If you manually (so don''t start the guest), balloon down
- say to 512MB and then launch
>> > a guest do you see this problem?
>> 
>> 
>> Should i use
>> 
>> xl  mem-max domain-id mem
>> 
>> or
>> 
>> xl  mem-set domain-id mem
> The later.
Ok tested that as well:
   - After "xl  mem-set 0 768M", xentop reports the new mem values
(free and for dom0) correctly, nothing else happens.
   - After a small wait, i tried to start a guest and it crashes dom0 with the
ballooning as before.
>> 
>> for that ?
>> 
>> 
>> Perhaps a silly question, but why is it ballooning anyway ?
>> I have set dom0''s memory and there is enough left to create
the domain ... or at least there should be ...
> There was a bug in xl that would autoballoon. You can turn it off using
some xl.conf file.
>> 
>> --
>> Sander

Stefano Stabellini

2012-Sep-11 16:02 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

On Wed, 5 Sep 2012, Konrad Rzeszutek Wilk wrote:> On Tue, Sep 04, 2012 at 04:27:20PM -0400, Robert Phillips wrote:
> > Ben,
> > 
> > You have asked me to provide the rationale behind the gnttab_old_mfn
patch, which you emailed to Sander earlier today.
> > Here are my findings.
> > 
> > I found that xen_blkbk_map() in drivers/block/xen-blkback/blkback.c
has changed from our previous version.  It now calls gnttab_map_refs() in
drivers/xen/grant-table.c.
> > 
> > That function first calls
HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, ... ) and then calls
m2p_add_override() in p2m.c
> 
> And HYPERVISOR_grant_table_op .. would populate map_ops[i].bus_addr with
the machine address..
> 
> > which is where I made my change.
> > 
> > The unpatched code was saving the pfn''s old mfn in
kmap_op->dev_bus_addr.
> > 
> > kmap_op is of type struct gnttab_map_grant_ref.  That data type is
used to record grant table mappings so later they can be unmapped correctly.
> 
> Right, but the blkback makes a distinction by passing NULL as kmap_op,
which means it should
> use the old mechanism. Meaning that once the hypercall is done, the
map_ops[i].bus_addr is not
> used anymore..
> 
> > 
> > The problem with saving the old mfn in kmap_op->dev_bus_addr is
that it is later overwritten by __gnttab_map_grant_ref() in
xen/common/grant_table.c
> 
> Uh, so the problem of saving the old mfn in dev_bus_addr has been there for
a long long time then?
> Even before this patch set?
I think that Robert identified the real problem: dev_bus_addr shouldn''t
have been used here. However the bug only shows up if we are batching
the grant table operations, that we started doing since
f62805f1f30a40e354bd036b4cb799863a39be4b.
That''s why Sander''s bisection found that
f62805f1f30a40e354bd036b4cb799863a39be4b is the culprit.

However the fix is incorrect because it is modifying a struct that is
part of the Xen ABI.
I am appending an alternative fix that doesn''t need any changes to
public headers.

Sander, could you please let me know if it fixes the problem for you?

---


diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index 93971e8..472b9b7 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -51,7 +51,8 @@ extern unsigned long set_phys_range_identity(unsigned long
pfn_s,
 
 extern int m2p_add_override(unsigned long mfn, struct page *page,
 			    struct gnttab_map_grant_ref *kmap_op);
-extern int m2p_remove_override(struct page *page, bool clear_pte);
+extern int m2p_remove_override(struct page *page,
+				struct gnttab_map_grant_ref *kmap_op);
 extern struct page *m2p_find_override(unsigned long mfn);
 extern unsigned long m2p_find_override_pfn(unsigned long mfn, unsigned long
pfn);
 
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 64effdc..2825594 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -734,9 +734,6 @@ int m2p_add_override(unsigned long mfn, struct page *page,
 
 			xen_mc_issue(PARAVIRT_LAZY_MMU);
 		}
-		/* let''s use dev_bus_addr to record the old mfn instead */
-		kmap_op->dev_bus_addr = page->index;
-		page->index = (unsigned long) kmap_op;
 	}
 	spin_lock_irqsave(&m2p_override_lock, flags);
 	list_add(&page->lru,  &m2p_overrides[mfn_hash(mfn)]);
@@ -763,7 +760,8 @@ int m2p_add_override(unsigned long mfn, struct page *page,
 	return 0;
 }
 EXPORT_SYMBOL_GPL(m2p_add_override);
-int m2p_remove_override(struct page *page, bool clear_pte)
+int m2p_remove_override(struct page *page,
+		struct gnttab_map_grant_ref *kmap_op)
 {
 	unsigned long flags;
 	unsigned long mfn;
@@ -793,10 +791,8 @@ int m2p_remove_override(struct page *page, bool clear_pte)
 	WARN_ON(!PagePrivate(page));
 	ClearPagePrivate(page);
 
-	if (clear_pte) {
-		struct gnttab_map_grant_ref *map_op -			(struct gnttab_map_grant_ref *)
page->index;
-		set_phys_to_machine(pfn, map_op->dev_bus_addr);
+	set_phys_to_machine(pfn, page->index);
+	if (kmap_op != NULL) {
 		if (!PageHighMem(page)) {
 			struct multicall_space mcs;
 			struct gnttab_unmap_grant_ref *unmap_op;
@@ -808,13 +804,13 @@ int m2p_remove_override(struct page *page, bool clear_pte)
 			 * issued. In this case handle is going to -1 because
 			 * it hasn''t been modified yet.
 			 */
-			if (map_op->handle == -1)
+			if (kmap_op->handle == -1)
 				xen_mc_flush();
 			/*
-			 * Now if map_op->handle is negative it means that the
+			 * Now if kmap_op->handle is negative it means that the
 			 * hypercall actually returned an error.
 			 */
-			if (map_op->handle == GNTST_general_error) {
+			if (kmap_op->handle == GNTST_general_error) {
 				printk(KERN_WARNING "m2p_remove_override: "
 						"pfn %lx mfn %lx, failed to modify kernel mappings",
 						pfn, mfn);
@@ -824,8 +820,8 @@ int m2p_remove_override(struct page *page, bool clear_pte)
 			mcs = xen_mc_entry(
 					sizeof(struct gnttab_unmap_grant_ref));
 			unmap_op = mcs.args;
-			unmap_op->host_addr = map_op->host_addr;
-			unmap_op->handle = map_op->handle;
+			unmap_op->host_addr = kmap_op->host_addr;
+			unmap_op->handle = kmap_op->handle;
 			unmap_op->dev_bus_addr = 0;
 
 			MULTI_grant_table_op(mcs.mc,
@@ -836,10 +832,9 @@ int m2p_remove_override(struct page *page, bool clear_pte)
 			set_pte_at(&init_mm, address, ptep,
 					pfn_pte(pfn, PAGE_KERNEL));
 			__flush_tlb_single(address);
-			map_op->host_addr = 0;
+			kmap_op->host_addr = 0;
 		}
-	} else
-		set_phys_to_machine(pfn, page->index);
+	}
 
 	/* p2m(m2p(mfn)) == FOREIGN_FRAME(mfn): the mfn is already present
 	 * somewhere in this domain, even before being added to the
diff --git a/drivers/block/xen-blkback/blkback.c
b/drivers/block/xen-blkback/blkback.c
index 73f196c..c6decb9 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -337,7 +337,7 @@ static void xen_blkbk_unmap(struct pending_req *req)
 		invcount++;
 	}
 
-	ret = gnttab_unmap_refs(unmap, pages, invcount, false);
+	ret = gnttab_unmap_refs(unmap, NULL, pages, invcount);
 	BUG_ON(ret);
 }
 
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 1ffd03b..7f12416 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -314,8 +314,9 @@ static int __unmap_grant_pages(struct grant_map *map, int
offset, int pages)
 		}
 	}
 
-	err = gnttab_unmap_refs(map->unmap_ops + offset, map->pages + offset,
-				pages, true);
+	err = gnttab_unmap_refs(map->unmap_ops + offset,
+			use_ptemod ? map->kmap_ops + offset : NULL, map->pages + offset,
+			pages);
 	if (err)
 		return err;
 
diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index 0bfc1ef..0067266 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -870,7 +870,8 @@ int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops,
 EXPORT_SYMBOL_GPL(gnttab_map_refs);
 
 int gnttab_unmap_refs(struct gnttab_unmap_grant_ref *unmap_ops,
-		      struct page **pages, unsigned int count, bool clear_pte)
+		      struct gnttab_map_grant_ref *kmap_ops,
+		      struct page **pages, unsigned int count)
 {
 	int i, ret;
 	bool lazy = false;
@@ -888,7 +889,8 @@ int gnttab_unmap_refs(struct gnttab_unmap_grant_ref
*unmap_ops,
 	}
 
 	for (i = 0; i < count; i++) {
-		ret = m2p_remove_override(pages[i], clear_pte);
+		ret = m2p_remove_override(pages[i], kmap_ops ?
+				       &kmap_ops[i] : NULL);
 		if (ret)
 			return ret;
 	}
diff --git a/include/xen/grant_table.h b/include/xen/grant_table.h
index 11e27c3..f19fff8 100644
--- a/include/xen/grant_table.h
+++ b/include/xen/grant_table.h
@@ -187,6 +187,7 @@ int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops,
 		    struct gnttab_map_grant_ref *kmap_ops,
 		    struct page **pages, unsigned int count);
 int gnttab_unmap_refs(struct gnttab_unmap_grant_ref *unmap_ops,
-		      struct page **pages, unsigned int count, bool clear_pte);
+		      struct gnttab_map_grant_ref *kunmap_ops,
+		      struct page **pages, unsigned int count);
 
 #endif /* __ASM_GNTTAB_H__ */

Sander Eikelenboom

2012-Sep-12 10:28 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Tuesday, September 11, 2012, 6:02:47 PM, you wrote:
> On Wed, 5 Sep 2012, Konrad Rzeszutek Wilk wrote:
>> On Tue, Sep 04, 2012 at 04:27:20PM -0400, Robert Phillips wrote:
>> > Ben,
>> > 
>> > You have asked me to provide the rationale behind the
gnttab_old_mfn patch, which you emailed to Sander earlier today.
>> > Here are my findings.
>> > 
>> > I found that xen_blkbk_map() in
drivers/block/xen-blkback/blkback.c has changed from our previous version.  It
now calls gnttab_map_refs() in drivers/xen/grant-table.c.
>> > 
>> > That function first calls
HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, ... ) and then calls
m2p_add_override() in p2m.c
>> 
>> And HYPERVISOR_grant_table_op .. would populate map_ops[i].bus_addr
with the machine address..
>> 
>> > which is where I made my change.
>> > 
>> > The unpatched code was saving the pfn''s old mfn in
kmap_op->dev_bus_addr.
>> > 
>> > kmap_op is of type struct gnttab_map_grant_ref.  That data type is
used to record grant table mappings so later they can be unmapped correctly.
>> 
>> Right, but the blkback makes a distinction by passing NULL as kmap_op,
which means it should
>> use the old mechanism. Meaning that once the hypercall is done, the
map_ops[i].bus_addr is not
>> used anymore..
>> 
>> > 
>> > The problem with saving the old mfn in kmap_op->dev_bus_addr is
that it is later overwritten by __gnttab_map_grant_ref() in
xen/common/grant_table.c
>> 
>> Uh, so the problem of saving the old mfn in dev_bus_addr has been there
for a long long time then?
>> Even before this patch set?
> I think that Robert identified the real problem: dev_bus_addr
shouldn''t
> have been used here. However the bug only shows up if we are batching
> the grant table operations, that we started doing since
> f62805f1f30a40e354bd036b4cb799863a39be4b.
> That''s why Sander''s bisection found that
> f62805f1f30a40e354bd036b4cb799863a39be4b is the culprit.
> However the fix is incorrect because it is modifying a struct that is
> part of the Xen ABI.
> I am appending an alternative fix that doesn''t need any changes to
> public headers.
> Sander, could you please let me know if it fixes the problem for you?
It does !

Tested-By: Sander Eikelenboom <linux@eikelenboom.it>
> ---
> diff --git a/arch/x86/include/asm/xen/page.h
b/arch/x86/include/asm/xen/page.h
> index 93971e8..472b9b7 100644
> --- a/arch/x86/include/asm/xen/page.h
> +++ b/arch/x86/include/asm/xen/page.h
> @@ -51,7 +51,8 @@ extern unsigned long set_phys_range_identity(unsigned
long pfn_s,
>  
>  extern int m2p_add_override(unsigned long mfn, struct page *page,
>                             struct gnttab_map_grant_ref *kmap_op);
> -extern int m2p_remove_override(struct page *page, bool clear_pte);
> +extern int m2p_remove_override(struct page *page,
> +                               struct gnttab_map_grant_ref *kmap_op);
>  extern struct page *m2p_find_override(unsigned long mfn);
>  extern unsigned long m2p_find_override_pfn(unsigned long mfn, unsigned
long pfn);
>  
> diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
> index 64effdc..2825594 100644
> --- a/arch/x86/xen/p2m.c
> +++ b/arch/x86/xen/p2m.c
> @@ -734,9 +734,6 @@ int m2p_add_override(unsigned long mfn, struct page
*page,
>  
>                         xen_mc_issue(PARAVIRT_LAZY_MMU);
>                 }
> -               /* let''s use dev_bus_addr to record the old mfn
instead */
> -               kmap_op->dev_bus_addr = page->index;
> -               page->index = (unsigned long) kmap_op;
>         }
>         spin_lock_irqsave(&m2p_override_lock, flags);
>         list_add(&page->lru,  &m2p_overrides[mfn_hash(mfn)]);
> @@ -763,7 +760,8 @@ int m2p_add_override(unsigned long mfn, struct page
*page,
>         return 0;
>  }
>  EXPORT_SYMBOL_GPL(m2p_add_override);
> -int m2p_remove_override(struct page *page, bool clear_pte)
> +int m2p_remove_override(struct page *page,
> +               struct gnttab_map_grant_ref *kmap_op)
>  {
>         unsigned long flags;
>         unsigned long mfn;
> @@ -793,10 +791,8 @@ int m2p_remove_override(struct page *page, bool
clear_pte)
>         WARN_ON(!PagePrivate(page));
>         ClearPagePrivate(page);
>  
> -       if (clear_pte) {
> -               struct gnttab_map_grant_ref *map_op > -                 
(struct gnttab_map_grant_ref *) page->index;
> -               set_phys_to_machine(pfn, map_op->dev_bus_addr);
> +       set_phys_to_machine(pfn, page->index);
> +       if (kmap_op != NULL) {
>                 if (!PageHighMem(page)) {
>                         struct multicall_space mcs;
>                         struct gnttab_unmap_grant_ref *unmap_op;
> @@ -808,13 +804,13 @@ int m2p_remove_override(struct page *page, bool
clear_pte)
>                          * issued. In this case handle is going to -1
because
>                          * it hasn''t been modified yet.
>                          */
> -                       if (map_op->handle == -1)
> +                       if (kmap_op->handle == -1)
>                                 xen_mc_flush();
>                         /*
> -                        * Now if map_op->handle is negative it means
that the
> +                        * Now if kmap_op->handle is negative it means
that the
>                          * hypercall actually returned an error.
>                          */
> -                       if (map_op->handle == GNTST_general_error) {
> +                       if (kmap_op->handle == GNTST_general_error) {
>                                 printk(KERN_WARNING
"m2p_remove_override: "
>                                                 "pfn %lx mfn %lx,
failed to modify kernel mappings",
>                                                 pfn, mfn);
> @@ -824,8 +820,8 @@ int m2p_remove_override(struct page *page, bool
clear_pte)
>                         mcs = xen_mc_entry(
>                                         sizeof(struct
gnttab_unmap_grant_ref));
>                         unmap_op = mcs.args;
> -                       unmap_op->host_addr = map_op->host_addr;
> -                       unmap_op->handle = map_op->handle;
> +                       unmap_op->host_addr = kmap_op->host_addr;
> +                       unmap_op->handle = kmap_op->handle;
>                         unmap_op->dev_bus_addr = 0;
>  
>                         MULTI_grant_table_op(mcs.mc,
> @@ -836,10 +832,9 @@ int m2p_remove_override(struct page *page, bool
clear_pte)
>                         set_pte_at(&init_mm, address, ptep,
>                                         pfn_pte(pfn, PAGE_KERNEL));
>                         __flush_tlb_single(address);
> -                       map_op->host_addr = 0;
> +                       kmap_op->host_addr = 0;
>                 }
> -       } else
> -               set_phys_to_machine(pfn, page->index);
> +       }
>  
>         /* p2m(m2p(mfn)) == FOREIGN_FRAME(mfn): the mfn is already present
>          * somewhere in this domain, even before being added to the
> diff --git a/drivers/block/xen-blkback/blkback.c
b/drivers/block/xen-blkback/blkback.c
> index 73f196c..c6decb9 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -337,7 +337,7 @@ static void xen_blkbk_unmap(struct pending_req *req)
>                 invcount++;
>         }
>  
> -       ret = gnttab_unmap_refs(unmap, pages, invcount, false);
> +       ret = gnttab_unmap_refs(unmap, NULL, pages, invcount);
>         BUG_ON(ret);
>  }
>  
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index 1ffd03b..7f12416 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -314,8 +314,9 @@ static int __unmap_grant_pages(struct grant_map *map,
int offset, int pages)
>                 }
>         }
>  
> -       err = gnttab_unmap_refs(map->unmap_ops + offset, map->pages +
offset,
> -                               pages, true);
> +       err = gnttab_unmap_refs(map->unmap_ops + offset,
> +                       use_ptemod ? map->kmap_ops + offset : NULL,
map->pages + offset,
> +                       pages);
>         if (err)
>                 return err;
>  
> diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
> index 0bfc1ef..0067266 100644
> --- a/drivers/xen/grant-table.c
> +++ b/drivers/xen/grant-table.c
> @@ -870,7 +870,8 @@ int gnttab_map_refs(struct gnttab_map_grant_ref
*map_ops,
>  EXPORT_SYMBOL_GPL(gnttab_map_refs);
>  
>  int gnttab_unmap_refs(struct gnttab_unmap_grant_ref *unmap_ops,
> -                     struct page **pages, unsigned int count, bool
clear_pte)
> +                     struct gnttab_map_grant_ref *kmap_ops,
> +                     struct page **pages, unsigned int count)
>  {
>         int i, ret;
>         bool lazy = false;
> @@ -888,7 +889,8 @@ int gnttab_unmap_refs(struct gnttab_unmap_grant_ref
*unmap_ops,
>         }
>  
>         for (i = 0; i < count; i++) {
> -               ret = m2p_remove_override(pages[i], clear_pte);
> +               ret = m2p_remove_override(pages[i], kmap_ops ?
> +                                      &kmap_ops[i] : NULL);
>                 if (ret)
>                         return ret;
>         }
> diff --git a/include/xen/grant_table.h b/include/xen/grant_table.h
> index 11e27c3..f19fff8 100644
> --- a/include/xen/grant_table.h
> +++ b/include/xen/grant_table.h
> @@ -187,6 +187,7 @@ int gnttab_map_refs(struct gnttab_map_grant_ref
*map_ops,
>                     struct gnttab_map_grant_ref *kmap_ops,
>                     struct page **pages, unsigned int count);
>  int gnttab_unmap_refs(struct gnttab_unmap_grant_ref *unmap_ops,
> -                     struct page **pages, unsigned int count, bool
clear_pte);
> +                     struct gnttab_map_grant_ref *kunmap_ops,
> +                     struct page **pages, unsigned int count);
>  
>  #endif /* __ASM_GNTTAB_H__ */

Stefano Stabellini

2012-Sep-12 11:28 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

On Wed, 12 Sep 2012, Sander Eikelenboom wrote:> Tuesday, September 11, 2012, 6:02:47 PM, you wrote:
> 
> > On Wed, 5 Sep 2012, Konrad Rzeszutek Wilk wrote:
> >> On Tue, Sep 04, 2012 at 04:27:20PM -0400, Robert Phillips wrote:
> >> > Ben,
> >> > 
> >> > You have asked me to provide the rationale behind the
gnttab_old_mfn patch, which you emailed to Sander earlier today.
> >> > Here are my findings.
> >> > 
> >> > I found that xen_blkbk_map() in
drivers/block/xen-blkback/blkback.c has changed from our previous version.  It
now calls gnttab_map_refs() in drivers/xen/grant-table.c.
> >> > 
> >> > That function first calls
HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, ... ) and then calls
m2p_add_override() in p2m.c
> >> 
> >> And HYPERVISOR_grant_table_op .. would populate
map_ops[i].bus_addr with the machine address..
> >> 
> >> > which is where I made my change.
> >> > 
> >> > The unpatched code was saving the pfn''s old mfn in
kmap_op->dev_bus_addr.
> >> > 
> >> > kmap_op is of type struct gnttab_map_grant_ref.  That data
type is used to record grant table mappings so later they can be unmapped
correctly.
> >> 
> >> Right, but the blkback makes a distinction by passing NULL as
kmap_op, which means it should
> >> use the old mechanism. Meaning that once the hypercall is done,
the map_ops[i].bus_addr is not
> >> used anymore..
> >> 
> >> > 
> >> > The problem with saving the old mfn in
kmap_op->dev_bus_addr is that it is later overwritten by
__gnttab_map_grant_ref() in xen/common/grant_table.c
> >> 
> >> Uh, so the problem of saving the old mfn in dev_bus_addr has been
there for a long long time then?
> >> Even before this patch set?
> 
> > I think that Robert identified the real problem: dev_bus_addr
shouldn''t
> > have been used here. However the bug only shows up if we are batching
> > the grant table operations, that we started doing since
> > f62805f1f30a40e354bd036b4cb799863a39be4b.
> > That''s why Sander''s bisection found that
> > f62805f1f30a40e354bd036b4cb799863a39be4b is the culprit.
> 
> > However the fix is incorrect because it is modifying a struct that is
> > part of the Xen ABI.
> > I am appending an alternative fix that doesn''t need any
changes to
> > public headers.
> 
> > Sander, could you please let me know if it fixes the problem for you?
> 
> It does !
> 
> Tested-By: Sander Eikelenboom <linux@eikelenboom.it>
> 
Thanks for testing!

Konrad Rzeszutek Wilk

2012-Sep-13 13:32 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

> > Sander, could you please let me know if it fixes the problem for you?
> 
> It does !
> 
> Tested-By: Sander Eikelenboom <linux@eikelenboom.it>
Excellent. Applied. Thx for reporting and testing.

Robert Phillips

2012-Sep-13 13:42 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

In our tree, I have tested Stefano''s patch (replacing the
"gnttab_old_mfn" patch which Ben previously provided).
It seems to work just fine.
Thanks, Stefano.

-- rsp

-----Original Message-----
From: Konrad Rzeszutek [mailto:ketuzsezr@gmail.com] On Behalf Of Konrad
Rzeszutek Wilk
Sent: Thursday, September 13, 2012 9:32 AM
To: Sander Eikelenboom
Cc: Stefano Stabellini; Robert Phillips; xen-devel@lists.xen.org; Ben Guthro;
Konrad Rzeszutek Wilk
Subject: Re: [Xen-devel] dom0 linux 3.6.0-rc4, crash due to ballooning althoug
dom0_mem=X, max:X set
> > Sander, could you please let me know if it fixes the problem for you?
> 
> It does !
> 
> Tested-By: Sander Eikelenboom <linux@eikelenboom.it>
Excellent. Applied. Thx for reporting and testing.

Conny Seidel

2012-Sep-14 14:53 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Hi,


On Thu, 13 Sep 2012 09:32:14 -0400
Konrad Rzeszutek Wilk <konrad@kernel.org> wrote:
>> > Sander, could you please let me know if it fixes the problem for
>> > you?
>>
>> It does !
>>
>> Tested-By: Sander Eikelenboom <linux@eikelenboom.it>
>
>Excellent. Applied. Thx for reporting and testing.
Is it possible that this patch is backported to stable?

--
Kind regards.

Conny Seidel

##################################################################
# Email : conny.seidel@amd.com            GnuPG-Key : 0xA6AB055D #
# Fingerprint: 17C4 5DB2 7C4C C1C7 1452 8148 F139 7C09 A6AB 055D #
##################################################################
# Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach      #
# General Managers: Alberto Bozzo                                #
# Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen #
#               HRB Nr. 43632                                    #
##################################################################


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2012-Sep-14 17:00 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

On Fri, Sep 14, 2012 at 04:53:33PM +0200, Conny Seidel
wrote:> Hi,
> 
> 
> On Thu, 13 Sep 2012 09:32:14 -0400
> Konrad Rzeszutek Wilk <konrad@kernel.org> wrote:
> 
> >> > Sander, could you please let me know if it fixes the problem
for
> >> > you?
> >>
> >> It does !
> >>
> >> Tested-By: Sander Eikelenboom <linux@eikelenboom.it>
> >
> >Excellent. Applied. Thx for reporting and testing.
> 
> Is it possible that this patch is backported to stable?
It is on the stable release train.

Conny Seidel

2012-Sep-14 17:38 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

On Fri, 14 Sep 2012 13:00:42 -0400
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
>On Fri, Sep 14, 2012 at 04:53:33PM +0200, Conny Seidel wrote:
>> Hi,
>>
>>
>> On Thu, 13 Sep 2012 09:32:14 -0400
>> Konrad Rzeszutek Wilk <konrad@kernel.org> wrote:
>>
>> >> > Sander, could you please let me know if it fixes the
problem for
>> >> > you?
>> >>
>> >> It does !
>> >>
>> >> Tested-By: Sander Eikelenboom <linux@eikelenboom.it>
>> >
>> >Excellent. Applied. Thx for reporting and testing.
>>
>> Is it possible that this patch is backported to stable?
>
>It is on the stable release train.
>Thank you, thats nice to know.

--
Kind regards.

Conny Seidel

##################################################################
# Email : conny.seidel@amd.com            GnuPG-Key : 0xA6AB055D #
# Fingerprint: 17C4 5DB2 7C4C C1C7 1452 8148 F139 7C09 A6AB 055D #
##################################################################
# Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach      #
# General Managers: Alberto Bozzo                                #
# Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen #
#               HRB Nr. 43632                                    #
##################################################################


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Sander Eikelenboom

2012-Sep-17 19:14 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Thursday, September 13, 2012, 3:32:14 PM, you wrote:
>> > Sander, could you please let me know if it fixes the problem for
you?
>> 
>> It does !
>> 
>> Tested-By: Sander Eikelenboom <linux@eikelenboom.it>
> Excellent. Applied. Thx for reporting and testing.
Hi Konrad,

Could it be that i haven''t seen a pull request for this one for the
3.6.0 kernel yet ?

--

Sander

Konrad Rzeszutek Wilk

2012-Sep-17 19:23 UTC

head link

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

On Mon, Sep 17, 2012 at 09:14:52PM +0200, Sander Eikelenboom
wrote:> Thursday, September 13, 2012, 3:32:14 PM, you wrote:
> 
> >> > Sander, could you please let me know if it fixes the problem
for you?
> >> 
> >> It does !
> >> 
> >> Tested-By: Sander Eikelenboom <linux@eikelenboom.it>
> 
> > Excellent. Applied. Thx for reporting and testing.
> 
> Hi Konrad,
> 
> Could it be that i haven''t seen a pull request for this one for
the 3.6.0 kernel yet ?
Correct. I am waiting for Andre Przyrwa to give me heads up on the AMD
NUMA bugfix so I can push two bug-fixes to Linus ASAP.
> 
> --
> 
> Sander
>

Xen devel - Sep 2012 - dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set

Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set