thr3ads.net - Xen devel - [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset [Jun 2010]

If this information is useful, please help other people find it:
Share via:

Xu, Dongxiao

2010-Jun-10 11:48 UTC

[Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

Hi Jeremy,

The attached patch should fix the PV network issue after applying the netback
multiple threads patchset.

This bug comes from the version upgrade when resending my patchset last time. 

In patch v1 & v2, I used kzalloc() to allocate memory, so there is no
problem.
In patch v3, I changed it to "__get_free_pages(GFP_KERNEL | __GFP_ZERO,
order)", so there is also no problem.
However in patch v4, when modified it to vmalloc according to the comments,
memset(0) is missing. Sorry for that.

*ONE POINT IS*, the phenomenon on my side is different from the bug reports from
mailing list. They often saw GPLPV network long ping latency, and on my side,
system would crash when enabling guest network.

After applying the patch, network in the following cases are working fine in my
side.

WinXP 32bit with gplpv_XP_0.11.0.213.msi
64bit Linux PV guest.
64bit RHEL5u3 HVM guest with PV driver

Can somebody help to try this fixing for long ping latency phenomenon?
Appreciate for that!

Thanks,
Dongxiao


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Jun-10 12:29 UTC

head link

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

>>> On 10.06.10 at 13:48, "Xu, Dongxiao"
<dongxiao.xu@intel.com> wrote:
> Hi Jeremy,
> 
> The attached patch should fix the PV network issue after applying the 
> netback multiple threads patchset.
> 
> This bug comes from the version upgrade when resending my patchset last 
> time. 
> 
> In patch v1 & v2, I used kzalloc() to allocate memory, so there is no 
> problem.
> In patch v3, I changed it to "__get_free_pages(GFP_KERNEL |
__GFP_ZERO,
> order)", so there is also no problem.
> However in patch v4, when modified it to vmalloc according to the comments,
> memset(0) is missing. Sorry for that.
A perhaps better alternative would be to use

__vmalloc(..., GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO, PAGE_KERNEL);

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

<djmagee@mageenet.net>

2010-Jun-10 12:51 UTC

head link

RE: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue fornetback multiple threads patchset

I can confirm that this patch solves the problem for me.  Tested in Win7
x64 with GPLPV 0.11.0.213, and with a PV linux domain running fedora 12
64bit.

Doug Magee

-----Original Message-----
From: xen-devel-bounces@lists.xensource.com
[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Xu, Dongxiao
Sent: Thursday, June 10, 2010 7:49 AM
To: Jeremy Fitzhardinge
Cc: xen-devel@lists.xensource.com; djmagee@mageenet.net;
Pasi@host212.adamapps.net; Fantu
Subject: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue
fornetback multiple threads patchset

Hi Jeremy,

The attached patch should fix the PV network issue after applying the
netback multiple threads patchset.

This bug comes from the version upgrade when resending my patchset last
time. 

In patch v1 & v2, I used kzalloc() to allocate memory, so there is no
problem.
In patch v3, I changed it to "__get_free_pages(GFP_KERNEL | __GFP_ZERO,
order)", so there is also no problem.
However in patch v4, when modified it to vmalloc according to the
comments, memset(0) is missing. Sorry for that.

*ONE POINT IS*, the phenomenon on my side is different from the bug
reports from mailing list. They often saw GPLPV network long ping
latency, and on my side, system would crash when enabling guest network.


After applying the patch, network in the following cases are working
fine in my side. 

WinXP 32bit with gplpv_XP_0.11.0.213.msi 64bit Linux PV guest.
64bit RHEL5u3 HVM guest with PV driver

Can somebody help to try this fixing for long ping latency phenomenon?
Appreciate for that!

Thanks,
Dongxiao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Pasi Kärkkäinen

2010-Jun-10 12:58 UTC

head link

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue fornetback multiple threads patchset

On Thu, Jun 10, 2010 at 08:51:24AM -0400, djmagee@mageenet.net
wrote:> I can confirm that this patch solves the problem for me.  Tested in Win7
> x64 with GPLPV 0.11.0.213, and with a PV linux domain running fedora 12
> 64bit.
> 
Good to hear. 

So I guess Jeremy can revert the revert and apply the bugfix patch :)

-- Pasi
> Doug Magee
> 
> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Xu, Dongxiao
> Sent: Thursday, June 10, 2010 7:49 AM
> To: Jeremy Fitzhardinge
> Cc: xen-devel@lists.xensource.com; djmagee@mageenet.net;
> Pasi@host212.adamapps.net; Fantu
> Subject: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue
> fornetback multiple threads patchset
> 
> Hi Jeremy,
> 
> The attached patch should fix the PV network issue after applying the
> netback multiple threads patchset.
> 
> This bug comes from the version upgrade when resending my patchset last
> time. 
> 
> In patch v1 & v2, I used kzalloc() to allocate memory, so there is no
> problem.
> In patch v3, I changed it to "__get_free_pages(GFP_KERNEL |
__GFP_ZERO,
> order)", so there is also no problem.
> However in patch v4, when modified it to vmalloc according to the
> comments, memset(0) is missing. Sorry for that.
> 
> *ONE POINT IS*, the phenomenon on my side is different from the bug
> reports from mailing list. They often saw GPLPV network long ping
> latency, and on my side, system would crash when enabling guest network.
> 
> 
> After applying the patch, network in the following cases are working
> fine in my side. 
> 
> WinXP 32bit with gplpv_XP_0.11.0.213.msi 64bit Linux PV guest.
> 64bit RHEL5u3 HVM guest with PV driver
> 
> Can somebody help to try this fixing for long ping latency phenomenon?
> Appreciate for that!
> 
> Thanks,
> Dongxiao
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2010-Jun-11 09:35 UTC

head link

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

On Thu, 2010-06-10 at 12:48 +0100, Xu, Dongxiao wrote:> Hi Jeremy,
> 
> The attached patch should fix the PV network issue after applying the
netback multiple threads patchset.
Thanks for this Donxiao. Do you think this crash was a potential symptom
of this issue? It does seem to go away if I apply your patch.
        BUG: unable to handle kernel paging request at 70000027
        IP: [<c0294867>] make_tx_response+0x17/0xd0
        *pdpt = 0000000000000000
        Oops: 0000 [#2] SMP
        last sysfs file:
        Modules linked in:
        Supported: Yes

        Pid: 1083, comm: netback/0 Tainted: G      D  
(2.6.27.45-0.1.1-x86_32p-xen #222)
        EIP: 0061:[<c0294867>] EFLAGS: 00010296 CPU: 0
        EIP is at make_tx_response+0x17/0xd0
        EAX: 6fffffff EBX: 00000000 ECX: 00000000 EDX: f00610a4
        ESI: 6fffffff EDI: f00620a4 EBP: ed0c3f18 ESP: ed0c3f0c
         DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: e021
        Process netback/0 (pid: 1083, ti=ed0c2000 task=ee9de070
task.ti=ed0c2000)
        Stack: 00000000 00000000 f00620a4 ed0c3fa8 c029676a c0456000 ee9de070
ed0c3fd0
               ed0c3f94 00000002 ed0c3fb8 f0062ca4 f0061000 6fffffff 011d9000
f00620a4
               f006108c ed0c3f5c c04ffb00 c04ffb00 ed0c3fc0 ed0c3fbc ed0c3fb8
ed0c2000
        Call Trace:
         [<c029676a>] ? net_tx_action+0x32a/0xa50
         [<c0296f62>] ? netbk_action_thread+0x62/0x190
         [<c0296f00>] ? netbk_action_thread+0x0/0x190
         [<c013f84c>] ? kthread+0x3c/0x70
         [<c013f810>] ? kthread+0x0/0x70
         [<c0105633>] ? kernel_thread_helper+0x7/0x10
         ======================        Code: ec 8d 41 01 89 47 2c c7 45 e4 ea ff
ff ff eb dd 8d 74 26 00 55 66 0f be c9 89 e5 83 ec 0c 89 74 24 04 89 c6 89 1c 24
89 7c 24 08 <8b> 78 28 8b 40 30 0f b7 5a 08 83 e8 01 21 f8 8d 04 40 c1 e0
02
        EIP: [<c0294867>] make_tx_response+0x17/0xd0 SS:ESP e021:ed0c3f0c
        ---[ end trace f7e370bf10f6f981 ]---

The crash is in one of the calls to list_move_tail and I think it is
because netbk->pending_inuse_head not being initialised until after the
threads and/or tasklets are created (I was running in threaded mode).
Perhaps even though we are now zeroing the netbk struct those fields
should still be initialised before kicking off any potentially
asynchronous tasks?

I didn''t even start any guests so I think we only got to the reference
to pending_inuse_head because tx_work_todo can return a false positive
if netbk is not properly zeroed and therefore we can call net_tx_action
before we are ready.

On an unrelated note, do you have any plans to make the number of groups
react dynamically to CPU hotplug? Not necessarily while there are
actually active VIFs (might be tricky to get right) but perhaps only
when netback is idle (i.e. when there are no VIFs configured), since
often the dynamic adjustment of VCPUs happens at start of day to reduce
the domain 0 VCPU allocation from the total number of cores in the
machine to something more manageable.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xu, Dongxiao

2010-Jun-17 08:16 UTC

head link

RE: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

Ian,

Sorry for the late response, I was on vacation days before.

Ian Campbell wrote:> On Thu, 2010-06-10 at 12:48 +0100, Xu, Dongxiao wrote:
>> Hi Jeremy,
>> 
>> The attached patch should fix the PV network issue after applying
>> the netback multiple threads patchset. 
> 
> Thanks for this Donxiao. Do you think this crash was a potential
> symptom 
> of this issue? It does seem to go away if I apply your patch.
Actually, the phenomenon is the same on my side without the fixing patch.
>         BUG: unable to handle kernel paging request at 70000027
>         IP: [<c0294867>] make_tx_response+0x17/0xd0
>         *pdpt = 0000000000000000
>         Oops: 0000 [#2] SMP
>         last sysfs file:
>         Modules linked in:
>         Supported: Yes
> 
>         Pid: 1083, comm: netback/0 Tainted: G      D  
>         (2.6.27.45-0.1.1-x86_32p-xen #222) EIP: 0061:[<c0294867>]
>         EFLAGS: 00010296 CPU: 0 EIP is at make_tx_response+0x17/0xd0
>         EAX: 6fffffff EBX: 00000000 ECX: 00000000 EDX: f00610a4
>         ESI: 6fffffff EDI: f00620a4 EBP: ed0c3f18 ESP: ed0c3f0c
>          DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: e021
>         Process netback/0 (pid: 1083, ti=ed0c2000 task=ee9de070
>         task.ti=ed0c2000) Stack: 00000000 00000000 f00620a4 ed0c3fa8
>                c029676a c0456000 ee9de070 ed0c3fd0 ed0c3f94 00000002
>                ed0c3fb8 f0062ca4 f0061000 6fffffff 011d9000 f00620a4
>         f006108c ed0c3f5c c04ffb00 c04ffb00 ed0c3fc0 ed0c3fbc
>          ed0c3fb8 ed0c2000 Call Trace: [<c029676a>] ?
>          net_tx_action+0x32a/0xa50 [<c0296f62>] ?
>          netbk_action_thread+0x62/0x190 [<c0296f00>] ?
>          netbk_action_thread+0x0/0x190 [<c013f84c>] ?
>          kthread+0x3c/0x70 [<c013f810>] ? kthread+0x0/0x70
>          [<c0105633>] ? kernel_thread_helper+0x7/0x10
>          ======================>         Code: ec 8d 41 01 89 47 2c c7
45 e4 ea ff ff ff eb dd 8d 74
>         26 00 55 66 0f be c9 89 e5 83 ec 0c 89 74 24 04 89 c6 89 1c
>         24 89 7c 24 08 <8b> 78 28 8b 40 30 0f b7 5a 08 83 e8 01 21 f8
> 8d 04 40 c1 e0 02 EIP: [<c0294867>] make_tx_response+0x17/0xd0 SS:ESP
> e021:ed0c3f0c ---[ end trace f7e370bf10f6f981 ]---  
> 
> The crash is in one of the calls to list_move_tail and I think it is
> because netbk->pending_inuse_head not being initialised until after
> the 
> threads and/or tasklets are created (I was running in threaded mode).
> Perhaps even though we are now zeroing the netbk struct those fields
> should still be initialised before kicking off any potentially
> asynchronous tasks?
You are right, I will commit another patch to fix it.
> 
> I didn''t even start any guests so I think we only got to the
reference
> to pending_inuse_head because tx_work_todo can return a false positive
> if netbk is not properly zeroed and therefore we can call
> net_tx_action 
> before we are ready.
> 
> On an unrelated note, do you have any plans to make the number of
> groups 
> react dynamically to CPU hotplug? Not necessarily while there are
> actually active VIFs (might be tricky to get right) but perhaps only
> when netback is idle (i.e. when there are no VIFs configured), since
> often the dynamic adjustment of VCPUs happens at start of day to
> reduce 
> the domain 0 VCPU allocation from the total number of cores in the
> machine to something more manageable.
I''m sorry, currently I am busy with some other tasks and may not have
time to do this job.

But if the case is to reduce dom0 VCPU number, keep the group number
unchanged will not impact the performance, since the group reflects the
tasklet/kthread number, and it doesn''t have direct association with
dom0''s VCPU number.

Thanks,
Dongxiao

> 
> Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Jun-21 11:14 UTC

head link

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

On 06/17/2010 09:16 AM, Xu, Dongxiao wrote:> Ian,
>
> Sorry for the late response, I was on vacation days before.
>
> Ian Campbell wrote:
>   
>> On Thu, 2010-06-10 at 12:48 +0100, Xu, Dongxiao wrote:
>>     
>>> Hi Jeremy,
>>>
>>> The attached patch should fix the PV network issue after applying
>>> the netback multiple threads patchset. 
>>>       
>> Thanks for this Donxiao. Do you think this crash was a potential
>> symptom 
>> of this issue? It does seem to go away if I apply your patch.
>>     
> Actually, the phenomenon is the same on my side without the fixing patch.
>
>   
>>         BUG: unable to handle kernel paging request at 70000027
>>         IP: [<c0294867>] make_tx_response+0x17/0xd0
>>         *pdpt = 0000000000000000
>>         Oops: 0000 [#2] SMP
>>         last sysfs file:
>>         Modules linked in:
>>         Supported: Yes
>>
>>         Pid: 1083, comm: netback/0 Tainted: G      D  
>>         (2.6.27.45-0.1.1-x86_32p-xen #222) EIP: 0061:[<c0294867>]
>>         EFLAGS: 00010296 CPU: 0 EIP is at make_tx_response+0x17/0xd0
>>         EAX: 6fffffff EBX: 00000000 ECX: 00000000 EDX: f00610a4
>>         ESI: 6fffffff EDI: f00620a4 EBP: ed0c3f18 ESP: ed0c3f0c
>>          DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: e021
>>         Process netback/0 (pid: 1083, ti=ed0c2000 task=ee9de070
>>         task.ti=ed0c2000) Stack: 00000000 00000000 f00620a4 ed0c3fa8
>>                c029676a c0456000 ee9de070 ed0c3fd0 ed0c3f94 00000002
>>                ed0c3fb8 f0062ca4 f0061000 6fffffff 011d9000 f00620a4
>>         f006108c ed0c3f5c c04ffb00 c04ffb00 ed0c3fc0 ed0c3fbc
>>          ed0c3fb8 ed0c2000 Call Trace: [<c029676a>] ?
>>          net_tx_action+0x32a/0xa50 [<c0296f62>] ?
>>          netbk_action_thread+0x62/0x190 [<c0296f00>] ?
>>          netbk_action_thread+0x0/0x190 [<c013f84c>] ?
>>          kthread+0x3c/0x70 [<c013f810>] ? kthread+0x0/0x70
>>          [<c0105633>] ? kernel_thread_helper+0x7/0x10
>>          ======================>>         Code: ec 8d 41 01 89 47
2c c7 45 e4 ea ff ff ff eb dd 8d 74
>>         26 00 55 66 0f be c9 89 e5 83 ec 0c 89 74 24 04 89 c6 89 1c
>>         24 89 7c 24 08 <8b> 78 28 8b 40 30 0f b7 5a 08 83 e8 01
21 f8
>> 8d 04 40 c1 e0 02 EIP: [<c0294867>] make_tx_response+0x17/0xd0
SS:ESP
>> e021:ed0c3f0c ---[ end trace f7e370bf10f6f981 ]---  
>>
>> The crash is in one of the calls to list_move_tail and I think it is
>> because netbk->pending_inuse_head not being initialised until after
>> the 
>> threads and/or tasklets are created (I was running in threaded mode).
>> Perhaps even though we are now zeroing the netbk struct those fields
>> should still be initialised before kicking off any potentially
>> asynchronous tasks?
>>     
> You are right, I will commit another patch to fix it.
>   
Does this patch address it, or does it need something else?

Subject: [PATCH] xen/netback: make sure all the group structures are initialized
before starting async code

Split the netbk group array initialization into two phases: one to do
simple "can''t fail" initialization of lists, timers, locks,
etc; and a
second pass to allocate memory and start the async code.

This makes sure that all the basic stuff is in place by the time the async code
is running.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

diff --git a/drivers/xen/netback/netback.c b/drivers/xen/netback/netback.c
index 9a7ada2..95de476 100644
--- a/drivers/xen/netback/netback.c
+++ b/drivers/xen/netback/netback.c
@@ -1750,8 +1750,10 @@ static int __init netback_init(void)
 	/* We can increase reservation by this much in net_rx_action(). */
 //	balloon_update_driver_allowance(NET_RX_RING_SIZE);
 
+	/* First pass - do simple "can''t fail" setup */
 	for (group = 0; group < xen_netbk_group_nr; group++) {
 		struct xen_netbk *netbk = &xen_netbk[group];
+
 		skb_queue_head_init(&netbk->rx_queue);
 		skb_queue_head_init(&netbk->tx_queue);
 
@@ -1764,16 +1766,6 @@ static int __init netback_init(void)
 		netbk->netbk_tx_pending_timer.function  			netbk_tx_pending_timeout;
 
-		netbk->mmap_pages -			alloc_empty_pages_and_pagevec(MAX_PENDING_REQS);
-		if (!netbk->mmap_pages) {
-			printk(KERN_ALERT "%s: out of memory\n", __func__);
-			del_timer(&netbk->netbk_tx_pending_timer);
-			del_timer(&netbk->net_timer);
-			rc = -ENOMEM;
-			goto failed_init;
-		}
-
 		for (i = 0; i < MAX_PENDING_REQS; i++) {
 			page = netbk->mmap_pages[i];
 			SetPageForeign(page, netif_page_release);
@@ -1786,6 +1778,26 @@ static int __init netback_init(void)
 		for (i = 0; i < MAX_PENDING_REQS; i++)
 			netbk->pending_ring[i] = i;
 
+		INIT_LIST_HEAD(&netbk->pending_inuse_head);
+		INIT_LIST_HEAD(&netbk->net_schedule_list);
+
+		spin_lock_init(&netbk->net_schedule_list_lock);
+
+		atomic_set(&netbk->netfront_count, 0);
+	}
+
+	/* Second pass - do memory allocation and initialize threads/tasklets */
+	for (group = 0; group < xen_netbk_group_nr; group++) {
+		struct xen_netbk *netbk = &xen_netbk[group];
+
+		netbk->mmap_pages +			alloc_empty_pages_and_pagevec(MAX_PENDING_REQS);
+		if (!netbk->mmap_pages) {
+			printk(KERN_ALERT "%s: out of memory\n", __func__);
+			rc = -ENOMEM;
+			goto failed_init;
+		}
+
 		if (MODPARM_netback_kthread) {
 			init_waitqueue_head(&netbk->kthread.netbk_action_wq);
 			netbk->kthread.task @@ -1801,8 +1813,6 @@ static int __init
netback_init(void)
 					"kthread_run() fails at netback\n");
 				free_empty_pages_and_pagevec(netbk->mmap_pages,
 						MAX_PENDING_REQS);
-				del_timer(&netbk->netbk_tx_pending_timer);
-				del_timer(&netbk->net_timer);
 				rc = PTR_ERR(netbk->kthread.task);
 				goto failed_init;
 			}
@@ -1814,13 +1824,6 @@ static int __init netback_init(void)
 				     net_rx_action,
 				     (unsigned long)netbk);
 		}
-
-		INIT_LIST_HEAD(&netbk->pending_inuse_head);
-		INIT_LIST_HEAD(&netbk->net_schedule_list);
-
-		spin_lock_init(&netbk->net_schedule_list_lock);
-
-		atomic_set(&netbk->netfront_count, 0);
 	}
 
 	netbk_copy_skb_mode = NETBK_DONT_COPY_SKB;
@@ -1850,13 +1853,16 @@ static int __init netback_init(void)
 	return 0;
 
 failed_init:
-	for (i = 0; i < group; i++) {
+	for (i = 0; i < xen_netbk_group_nr; i++) {
 		struct xen_netbk *netbk = &xen_netbk[i];
-		free_empty_pages_and_pagevec(netbk->mmap_pages,
-				MAX_PENDING_REQS);
+
 		del_timer(&netbk->netbk_tx_pending_timer);
 		del_timer(&netbk->net_timer);
-		if (MODPARM_netback_kthread)
+
+		if (netbk->mmap_pages)
+			free_empty_pages_and_pagevec(netbk->mmap_pages,
+						     MAX_PENDING_REQS);
+		if (MODPARM_netback_kthread && netbk->kthread.task)
 			kthread_stop(netbk->kthread.task);
 	}
 	vfree(xen_netbk);

	J


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xu, Dongxiao

2010-Jun-22 02:29 UTC

head link

RE: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

Jeremy Fitzhardinge wrote:> On 06/17/2010 09:16 AM, Xu, Dongxiao wrote:
>> Ian,
>> 
>> Sorry for the late response, I was on vacation days before.
>> 
>> Ian Campbell wrote:
>> 
>>> On Thu, 2010-06-10 at 12:48 +0100, Xu, Dongxiao wrote:
>>> 
>>>> Hi Jeremy,
>>>> 
>>>> The attached patch should fix the PV network issue after
applying
>>>> the netback multiple threads patchset.
>>>> 
>>> Thanks for this Donxiao. Do you think this crash was a potential
>>> symptom of this issue? It does seem to go away if I apply your
>>> patch. 
>>> 
>> Actually, the phenomenon is the same on my side without the fixing
>> patch. 
>> 
>> 
>>>         BUG: unable to handle kernel paging request at 70000027
>>>         IP: [<c0294867>] make_tx_response+0x17/0xd0
>>>         *pdpt = 0000000000000000
>>>         Oops: 0000 [#2] SMP
>>>         last sysfs file:
>>>         Modules linked in:
>>>         Supported: Yes
>>> 
>>>         Pid: 1083, comm: netback/0 Tainted: G      D
>>>         (2.6.27.45-0.1.1-x86_32p-xen #222) EIP:
0061:[<c0294867>]
>>>         EFLAGS: 00010296 CPU: 0 EIP is at
make_tx_response+0x17/0xd0
>>>         EAX: 6fffffff EBX: 00000000 ECX: 00000000 EDX: f00610a4
>>>         ESI: 6fffffff EDI: f00620a4 EBP: ed0c3f18 ESP: ed0c3f0c
>>>          DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: e021
>>>         Process netback/0 (pid: 1083, ti=ed0c2000 task=ee9de070
>>>         task.ti=ed0c2000) Stack: 00000000 00000000 f00620a4
ed0c3fa8
>>>                c029676a c0456000 ee9de070 ed0c3fd0 ed0c3f94
00000002
>>>                ed0c3fb8 f0062ca4 f0061000 6fffffff 011d9000
f00620a4
>>>         f006108c ed0c3f5c c04ffb00 c04ffb00 ed0c3fc0 ed0c3fbc
>>>          ed0c3fb8 ed0c2000 Call Trace: [<c029676a>] ?
>>>          net_tx_action+0x32a/0xa50 [<c0296f62>] ?
>>>          netbk_action_thread+0x62/0x190 [<c0296f00>] ?
>>>          netbk_action_thread+0x0/0x190 [<c013f84c>] ?
>>>          kthread+0x3c/0x70 [<c013f810>] ? kthread+0x0/0x70
>>>          [<c0105633>] ? kernel_thread_helper+0x7/0x10
>>>          ======================>>>         Code: ec 8d 41
01 89 47 2c c7 45 e4 ea ff ff ff eb dd 8d 74
>>>         26 00 55 66 0f be c9 89 e5 83 ec 0c 89 74 24 04 89 c6 89 1c
>>>         24 89 7c 24 08 <8b> 78 28 8b 40 30 0f b7 5a 08 83 e8
01 21
>>> f8 8d 04 40 c1 e0 02 EIP: [<c0294867>]
make_tx_response+0x17/0xd0
>>> SS:ESP e021:ed0c3f0c ---[ end trace f7e370bf10f6f981 ]---
>>> 
>>> The crash is in one of the calls to list_move_tail and I think it
is
>>> because netbk->pending_inuse_head not being initialised until
after
>>> the threads and/or tasklets are created (I was running in threaded
>>> mode). Perhaps even though we are now zeroing the netbk struct
>>> those fields should still be initialised before kicking off any
>>> potentially asynchronous tasks? 
>>> 
>> You are right, I will commit another patch to fix it.
>> 
> 
> Does this patch address it, or does it need something else?
Hi Jeremy,

The patch is good except that we should move mmap_pages related
code back after it is initialized. So I modified your patch a little.

Thanks,
Dongxiao


Subject: [PATCH] xen/netback: make sure all the group structures are initialized
before starting async code

Split the netbk group array initialization into two phases: one to do
simple "can''t fail" initialization of lists, timers, locks,
etc; and a
second pass to allocate memory and start the async code.

This makes sure that all the basic stuff is in place by the time the async code
is running.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
---
 drivers/xen/netback/netback.c |   46 +++++++++++++++++++++++-----------------
 1 files changed, 26 insertions(+), 20 deletions(-)

diff --git a/drivers/xen/netback/netback.c b/drivers/xen/netback/netback.c
index 9a7ada2..f5d8952 100644
--- a/drivers/xen/netback/netback.c
+++ b/drivers/xen/netback/netback.c
@@ -1750,8 +1750,10 @@ static int __init netback_init(void)
 	/* We can increase reservation by this much in net_rx_action(). */
 //	balloon_update_driver_allowance(NET_RX_RING_SIZE);
 
+	/* First pass - do simple "can''t fail" setup */
 	for (group = 0; group < xen_netbk_group_nr; group++) {
 		struct xen_netbk *netbk = &xen_netbk[group];
+
 		skb_queue_head_init(&netbk->rx_queue);
 		skb_queue_head_init(&netbk->tx_queue);
 
@@ -1764,12 +1766,27 @@ static int __init netback_init(void)
 		netbk->netbk_tx_pending_timer.function  			netbk_tx_pending_timeout;
 
+		netbk->pending_cons = 0;
+		netbk->pending_prod = MAX_PENDING_REQS;
+		for (i = 0; i < MAX_PENDING_REQS; i++)
+			netbk->pending_ring[i] = i;
+
+		INIT_LIST_HEAD(&netbk->pending_inuse_head);
+		INIT_LIST_HEAD(&netbk->net_schedule_list);
+
+		spin_lock_init(&netbk->net_schedule_list_lock);
+
+		atomic_set(&netbk->netfront_count, 0);
+	}
+
+	/* Second pass - do memory allocation and initialize threads/tasklets */
+	for (group = 0; group < xen_netbk_group_nr; group++) {
+		struct xen_netbk *netbk = &xen_netbk[group];
+
 		netbk->mmap_pages  			alloc_empty_pages_and_pagevec(MAX_PENDING_REQS);
 		if (!netbk->mmap_pages) {
 			printk(KERN_ALERT "%s: out of memory\n", __func__);
-			del_timer(&netbk->netbk_tx_pending_timer);
-			del_timer(&netbk->net_timer);
 			rc = -ENOMEM;
 			goto failed_init;
 		}
@@ -1781,11 +1798,6 @@ static int __init netback_init(void)
 			INIT_LIST_HEAD(&netbk->pending_inuse[i].list);
 		}
 
-		netbk->pending_cons = 0;
-		netbk->pending_prod = MAX_PENDING_REQS;
-		for (i = 0; i < MAX_PENDING_REQS; i++)
-			netbk->pending_ring[i] = i;
-
 		if (MODPARM_netback_kthread) {
 			init_waitqueue_head(&netbk->kthread.netbk_action_wq);
 			netbk->kthread.task @@ -1801,8 +1813,6 @@ static int __init
netback_init(void)
 					"kthread_run() fails at netback\n");
 				free_empty_pages_and_pagevec(netbk->mmap_pages,
 						MAX_PENDING_REQS);
-				del_timer(&netbk->netbk_tx_pending_timer);
-				del_timer(&netbk->net_timer);
 				rc = PTR_ERR(netbk->kthread.task);
 				goto failed_init;
 			}
@@ -1814,13 +1824,6 @@ static int __init netback_init(void)
 				     net_rx_action,
 				     (unsigned long)netbk);
 		}
-
-		INIT_LIST_HEAD(&netbk->pending_inuse_head);
-		INIT_LIST_HEAD(&netbk->net_schedule_list);
-
-		spin_lock_init(&netbk->net_schedule_list_lock);
-
-		atomic_set(&netbk->netfront_count, 0);
 	}
 
 	netbk_copy_skb_mode = NETBK_DONT_COPY_SKB;
@@ -1850,13 +1853,16 @@ static int __init netback_init(void)
 	return 0;
 
 failed_init:
-	for (i = 0; i < group; i++) {
+	for (i = 0; i < xen_netbk_group_nr; i++) {
 		struct xen_netbk *netbk = &xen_netbk[i];
-		free_empty_pages_and_pagevec(netbk->mmap_pages,
-				MAX_PENDING_REQS);
+
 		del_timer(&netbk->netbk_tx_pending_timer);
 		del_timer(&netbk->net_timer);
-		if (MODPARM_netback_kthread)
+
+		if (netbk->mmap_pages)
+			free_empty_pages_and_pagevec(netbk->mmap_pages,
+						     MAX_PENDING_REQS);
+		if (MODPARM_netback_kthread && netbk->kthread.task)
 			kthread_stop(netbk->kthread.task);
 	}
 	vfree(xen_netbk);
-- 
1.6.3
> 
> Subject: [PATCH] xen/netback: make sure all the group structures are
> initialized before starting async code 
> 
> Split the netbk group array initialization into two phases: one to do
> simple "can''t fail" initialization of lists, timers,
locks, etc; and a
> second pass to allocate memory and start the async code.
> 
> This makes sure that all the basic stuff is in place by the time the
> async code 
> is running.
> 
> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> 
> diff --git a/drivers/xen/netback/netback.c
> b/drivers/xen/netback/netback.c 
> index 9a7ada2..95de476 100644
> --- a/drivers/xen/netback/netback.c
> +++ b/drivers/xen/netback/netback.c
> @@ -1750,8 +1750,10 @@ static int __init netback_init(void)
>  	/* We can increase reservation by this much in net_rx_action(). */
>  //	balloon_update_driver_allowance(NET_RX_RING_SIZE);
> 
> +	/* First pass - do simple "can''t fail" setup */
>  	for (group = 0; group < xen_netbk_group_nr; group++) {
>  		struct xen_netbk *netbk = &xen_netbk[group];
> +
>  		skb_queue_head_init(&netbk->rx_queue);
>  		skb_queue_head_init(&netbk->tx_queue);
> 
> @@ -1764,16 +1766,6 @@ static int __init netback_init(void)
>  		netbk->netbk_tx_pending_timer.function >  		
netbk_tx_pending_timeout;
> 
> -		netbk->mmap_pages > -		
alloc_empty_pages_and_pagevec(MAX_PENDING_REQS);
> -		if (!netbk->mmap_pages) {
> -			printk(KERN_ALERT "%s: out of memory\n", __func__);
> -			del_timer(&netbk->netbk_tx_pending_timer);
> -			del_timer(&netbk->net_timer);
> -			rc = -ENOMEM;
> -			goto failed_init;
> -		}
> -
>  		for (i = 0; i < MAX_PENDING_REQS; i++) {
>  			page = netbk->mmap_pages[i];
>  			SetPageForeign(page, netif_page_release);
> @@ -1786,6 +1778,26 @@ static int __init netback_init(void)
>  		for (i = 0; i < MAX_PENDING_REQS; i++)
>  			netbk->pending_ring[i] = i;
> 
> +		INIT_LIST_HEAD(&netbk->pending_inuse_head);
> +		INIT_LIST_HEAD(&netbk->net_schedule_list);
> +
> +		spin_lock_init(&netbk->net_schedule_list_lock);
> +
> +		atomic_set(&netbk->netfront_count, 0);
> +	}
> +
> +	/* Second pass - do memory allocation and initialize
> threads/tasklets */ +	for (group = 0; group < xen_netbk_group_nr;
> group++) { +		struct xen_netbk *netbk = &xen_netbk[group];
> +
> +		netbk->mmap_pages > +		
alloc_empty_pages_and_pagevec(MAX_PENDING_REQS);
> +		if (!netbk->mmap_pages) {
> +			printk(KERN_ALERT "%s: out of memory\n", __func__);
> +			rc = -ENOMEM;
> +			goto failed_init;
> +		}
> +
>  		if (MODPARM_netback_kthread) {
>  			init_waitqueue_head(&netbk->kthread.netbk_action_wq);
>  			netbk->kthread.task > @@ -1801,8 +1813,6 @@ static int __init
netback_init(void)
>  					"kthread_run() fails at netback\n");
>  				free_empty_pages_and_pagevec(netbk->mmap_pages,
>  						MAX_PENDING_REQS);
> -				del_timer(&netbk->netbk_tx_pending_timer);
> -				del_timer(&netbk->net_timer);
>  				rc = PTR_ERR(netbk->kthread.task);
>  				goto failed_init;
>  			}
> @@ -1814,13 +1824,6 @@ static int __init netback_init(void)
>  				     net_rx_action,
>  				     (unsigned long)netbk);
>  		}
> -
> -		INIT_LIST_HEAD(&netbk->pending_inuse_head);
> -		INIT_LIST_HEAD(&netbk->net_schedule_list);
> -
> -		spin_lock_init(&netbk->net_schedule_list_lock);
> -
> -		atomic_set(&netbk->netfront_count, 0);
>  	}
> 
>  	netbk_copy_skb_mode = NETBK_DONT_COPY_SKB;
> @@ -1850,13 +1853,16 @@ static int __init netback_init(void)
>  	return 0;
> 
>  failed_init:
> -	for (i = 0; i < group; i++) {
> +	for (i = 0; i < xen_netbk_group_nr; i++) {
>  		struct xen_netbk *netbk = &xen_netbk[i];
> -		free_empty_pages_and_pagevec(netbk->mmap_pages,
> -				MAX_PENDING_REQS);
> +
>  		del_timer(&netbk->netbk_tx_pending_timer);
>  		del_timer(&netbk->net_timer);
> -		if (MODPARM_netback_kthread)
> +
> +		if (netbk->mmap_pages)
> +			free_empty_pages_and_pagevec(netbk->mmap_pages,
> +						     MAX_PENDING_REQS);
> +		if (MODPARM_netback_kthread && netbk->kthread.task)
>  			kthread_stop(netbk->kthread.task);
>  	}
>  	vfree(xen_netbk);
> 
> 	J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2010-Jun-24 08:33 UTC

head link

RE: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

On Thu, 2010-06-17 at 09:16 +0100, Xu, Dongxiao wrote:> Ian,
> 
> Sorry for the late response, I was on vacation days before.
I was also on vacation so sorry in _my_ late reply ;-)
> Ian Campbell wrote:
> > On Thu, 2010-06-10 at 12:48 +0100, Xu, Dongxiao wrote:
> >> Hi Jeremy,
> >> 
> >> The attached patch should fix the PV network issue after applying
> >> the netback multiple threads patchset. 
> > 
> > Thanks for this Donxiao. Do you think this crash was a potential
> > symptom 
> > of this issue? It does seem to go away if I apply your patch.
> 
> Actually, the phenomenon is the same on my side without the fixing patch.
Great, thanks.
> > On an unrelated note, do you have any plans to make the number of
> > groups 
> > react dynamically to CPU hotplug? Not necessarily while there are
> > actually active VIFs (might be tricky to get right) but perhaps only
> > when netback is idle (i.e. when there are no VIFs configured), since
> > often the dynamic adjustment of VCPUs happens at start of day to
> > reduce 
> > the domain 0 VCPU allocation from the total number of cores in the
> > machine to something more manageable.
> 
> I''m sorry, currently I am busy with some other tasks and may not
have
> time to do this job.
I understand.
> But if the case is to reduce dom0 VCPU number, keep the group number
> unchanged will not impact the performance, since the group reflects the
> tasklet/kthread number, and it doesn''t have direct association
with
> dom0''s VCPU number.
Yes, that mitigates the issue to a large degree. I was just concerned
about e.g. 64 threads competing for 4VCPU or similar which seems
wasteful in terms of some resource or other...

For XCP (which may soon switch from 1 to 4 domain 0 VCPUS in the
unstable branch) I''ve been thinking of the following patch. I wonder if
it might make sense in general? 4 is rather arbitrarily chosen but I
think even on a 64 core machine you wouldn''t want to dedicate more than
some fraction netback activity and if you do then it is configurable.

Ian.


netback: allow configuration of maximum number of groups to use
limit to 4 by default.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>

diff -r 7692c6381e1a drivers/xen/netback/netback.c
--- a/drivers/xen/netback/netback.c	Fri Jun 11 08:44:25 2010 +0100
+++ b/drivers/xen/netback/netback.c	Fri Jun 11 09:31:48 2010 +0100
@@ -124,6 +124,10 @@
 static int MODPARM_netback_kthread = 1;
 module_param_named(netback_kthread, MODPARM_netback_kthread, bool, 0);
 MODULE_PARM_DESC(netback_kthread, "Use kernel thread to replace
tasklet");
+
+static unsigned int MODPARM_netback_max_groups = 4;
+module_param_named(netback_max_groups, MODPARM_netback_max_groups, bool, 0);
+MODULE_PARM_DESC(netback_max_groups, "Maximum number of netback groups to
allocate");
 
 /*
  * Netback bottom half handler.
@@ -1748,7 +1752,7 @@
 	if (!is_running_on_xen())
 		return -ENODEV;
 
-	xen_netbk_group_nr = num_online_cpus();
+	xen_netbk_group_nr = min(num_online_cpus(), MODPARM_netback_max_groups);
 	xen_netbk = (struct xen_netbk *)vmalloc(sizeof(struct xen_netbk) *
 					    xen_netbk_group_nr);
 	if (!xen_netbk) {




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xu, Dongxiao

2010-Jun-25 07:31 UTC

head link

RE: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

Ian Campbell wrote:> On Thu, 2010-06-17 at 09:16 +0100, Xu, Dongxiao wrote:
>> Ian,
>> 
>> Sorry for the late response, I was on vacation days before.
> 
> I was also on vacation so sorry in _my_ late reply ;-)
> 
>> Ian Campbell wrote:
>>> On Thu, 2010-06-10 at 12:48 +0100, Xu, Dongxiao wrote:
>>>> Hi Jeremy,
>>>> 
>>>> The attached patch should fix the PV network issue after
applying
>>>> the netback multiple threads patchset.
>>> 
>>> Thanks for this Donxiao. Do you think this crash was a potential
>>> symptom of this issue? It does seem to go away if I apply your
>>> patch. 
>> 
>> Actually, the phenomenon is the same on my side without the fixing
>> patch. 
> 
> Great, thanks.
> 
>>> On an unrelated note, do you have any plans to make the number of
>>> groups react dynamically to CPU hotplug? Not necessarily while
>>> there are actually active VIFs (might be tricky to get right) but
>>> perhaps only when netback is idle (i.e. when there are no VIFs
>>> configured), since often the dynamic adjustment of VCPUs happens at
>>> start of day to reduce the domain 0 VCPU allocation from the total
>>> number of cores in the machine to something more manageable.
>> 
>> I''m sorry, currently I am busy with some other tasks and may
not have
>> time to do this job.
> 
> I understand.
> 
>> But if the case is to reduce dom0 VCPU number, keep the group number
>> unchanged will not impact the performance, since the group reflects
>> the tasklet/kthread number, and it doesn''t have direct
association
>> with dom0''s VCPU number.
> 
> Yes, that mitigates the issue to a large degree. I was just concerned
> about e.g. 64 threads competing for 4VCPU or similar which seems
> wasteful in terms of some resource or other...
> 
> For XCP (which may soon switch from 1 to 4 domain 0 VCPUS in the
> unstable branch) I''ve been thinking of the following patch. I
wonder
> if 
> it might make sense in general? 4 is rather arbitrarily chosen but I
> think even on a 64 core machine you wouldn''t want to dedicate more
> than 
> some fraction netback activity and if you do then it is configurable.
Basically I am OK with it. 
One concern is that, when system is equipped with 10G NIC,
according to my previous experience, 4 dom0 vCPU may be not enough
for eating all the bandwidth.

Thanks,
Dongxiao
> 
> Ian.
> 
> 
> netback: allow configuration of maximum number of groups to use
> limit to 4 by default.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> 
> diff -r 7692c6381e1a drivers/xen/netback/netback.c
> --- a/drivers/xen/netback/netback.c	Fri Jun 11 08:44:25 2010 +0100
> +++ b/drivers/xen/netback/netback.c	Fri Jun 11 09:31:48 2010 +0100
> @@ -124,6 +124,10 @@
>  static int MODPARM_netback_kthread = 1;
>  module_param_named(netback_kthread, MODPARM_netback_kthread, bool,
>  0); MODULE_PARM_DESC(netback_kthread, "Use kernel thread to replace
> tasklet"); +
> +static unsigned int MODPARM_netback_max_groups = 4;
> +module_param_named(netback_max_groups, MODPARM_netback_max_groups,
> bool, 0); +MODULE_PARM_DESC(netback_max_groups, "Maximum number of
> netback groups to allocate"); 
> 
>  /*
>   * Netback bottom half handler.
> @@ -1748,7 +1752,7 @@
>  	if (!is_running_on_xen())
>  		return -ENODEV;
> 
> -	xen_netbk_group_nr = num_online_cpus();
> +	xen_netbk_group_nr = min(num_online_cpus(),
>  	MODPARM_netback_max_groups); xen_netbk = (struct xen_netbk
>  					    *)vmalloc(sizeof(struct xen_netbk) * xen_netbk_group_nr);
>  	if (!xen_netbk) {

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2010-Jul-01 14:48 UTC

head link

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

On Mon, 2010-06-21 at 12:14 +0100, Jeremy Fitzhardinge
wrote:> Subject: [PATCH] xen/netback: make sure all the group structures are
initialized before starting async code
> 
> Split the netbk group array initialization into two phases: one to do
> simple "can''t fail" initialization of lists, timers,
locks, etc; and a
> second pass to allocate memory and start the async code.
> 
> This makes sure that all the basic stuff is in place by the time the async
code
> is running.
Paul noticed a crash in netback init because...
> @@ -1764,16 +1766,6 @@ static int __init netback_init(void)
>  		netbk->netbk_tx_pending_timer.function >  		
netbk_tx_pending_timeout;
>  
> -		netbk->mmap_pages > -		
alloc_empty_pages_and_pagevec(MAX_PENDING_REQS);
> -		if (!netbk->mmap_pages) {
> -			printk(KERN_ALERT "%s: out of memory\n", __func__);
> -			del_timer(&netbk->netbk_tx_pending_timer);
> -			del_timer(&netbk->net_timer);
> -			rc = -ENOMEM;
> -			goto failed_init;
> -		}
> -
>  		for (i = 0; i < MAX_PENDING_REQS; i++) {
>  			page = netbk->mmap_pages[i];
>  			SetPageForeign(page, netif_page_release);
...this dereference of netbk->mmap_pages[i]...
> @@ -1786,6 +1778,26 @@ static int __init netback_init(void)
[...]> +		netbk->mmap_pages > +		
alloc_empty_pages_and_pagevec(MAX_PENDING_REQS);
... happens before this initialisation of the array.

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Jul-01 15:29 UTC

head link

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

On 07/01/2010 04:48 PM, Ian Campbell wrote:> On Mon, 2010-06-21 at 12:14 +0100, Jeremy Fitzhardinge wrote:
>   
>> Subject: [PATCH] xen/netback: make sure all the group structures are
initialized before starting async code
>>
>> Split the netbk group array initialization into two phases: one to do
>> simple "can''t fail" initialization of lists, timers,
locks, etc; and a
>> second pass to allocate memory and start the async code.
>>
>> This makes sure that all the basic stuff is in place by the time the
async code
>> is running.
>>     
> Paul noticed a crash in netback init because...
>
>   
>> @@ -1764,16 +1766,6 @@ static int __init netback_init(void)
>>  		netbk->netbk_tx_pending_timer.function >>  		
netbk_tx_pending_timeout;
>>  
>> -		netbk->mmap_pages >> -		
alloc_empty_pages_and_pagevec(MAX_PENDING_REQS);
>> -		if (!netbk->mmap_pages) {
>> -			printk(KERN_ALERT "%s: out of memory\n", __func__);
>> -			del_timer(&netbk->netbk_tx_pending_timer);
>> -			del_timer(&netbk->net_timer);
>> -			rc = -ENOMEM;
>> -			goto failed_init;
>> -		}
>> -
>>  		for (i = 0; i < MAX_PENDING_REQS; i++) {
>>  			page = netbk->mmap_pages[i];
>>  			SetPageForeign(page, netif_page_release);
>>     
> ...this dereference of netbk->mmap_pages[i]...
>
>   
>> @@ -1786,6 +1778,26 @@ static int __init netback_init(void)
>>     
> [...]
>   
>> +		netbk->mmap_pages >> +		
alloc_empty_pages_and_pagevec(MAX_PENDING_REQS);
>>     
> ... happens before this initialisation of the array.
>   
Hm, I hadn''t meant to commit that properly.  I had it locally and
accidentally pushed it out.

I only did that patch as an RFC in response to an issue alluded to by
Dongxiao (or was it you?) about things not being fully initialized by
the time the async code starts.  Is this a real issue, and if so,
what''s
the correct fix?

Thanks,
    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2010-Jul-01 15:47 UTC

head link

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

On Thu, 2010-07-01 at 16:29 +0100, Jeremy Fitzhardinge
wrote:> On 07/01/2010 04:48 PM, Ian Campbell wrote:
> > On Mon, 2010-06-21 at 12:14 +0100, Jeremy Fitzhardinge wrote:
> >   
> >> Subject: [PATCH] xen/netback: make sure all the group structures
are initialized before starting async code
> >>
> >> Split the netbk group array initialization into two phases: one to
do
> >> simple "can''t fail" initialization of lists,
timers, locks, etc; and a
> >> second pass to allocate memory and start the async code.
> >>
> >> This makes sure that all the basic stuff is in place by the time
the async code
> >> is running.
> >>     
> > Paul noticed a crash in netback init because...
> >
> >   
> >> @@ -1764,16 +1766,6 @@ static int __init netback_init(void)
> >>  		netbk->netbk_tx_pending_timer.function > >>  		
netbk_tx_pending_timeout;
> >>  
> >> -		netbk->mmap_pages > >> -		
alloc_empty_pages_and_pagevec(MAX_PENDING_REQS);
> >> -		if (!netbk->mmap_pages) {
> >> -			printk(KERN_ALERT "%s: out of memory\n", __func__);
> >> -			del_timer(&netbk->netbk_tx_pending_timer);
> >> -			del_timer(&netbk->net_timer);
> >> -			rc = -ENOMEM;
> >> -			goto failed_init;
> >> -		}
> >> -
> >>  		for (i = 0; i < MAX_PENDING_REQS; i++) {
> >>  			page = netbk->mmap_pages[i];
> >>  			SetPageForeign(page, netif_page_release);
> >>     
> > ...this dereference of netbk->mmap_pages[i]...
> >
> >   
> >> @@ -1786,6 +1778,26 @@ static int __init netback_init(void)
> >>     
> > [...]
> >   
> >> +		netbk->mmap_pages > >> +		
alloc_empty_pages_and_pagevec(MAX_PENDING_REQS);
> >>     
> > ... happens before this initialisation of the array.
> >   
> 
> Hm, I hadn''t meant to commit that properly.  I had it locally and
> accidentally pushed it out.
> 
> I only did that patch as an RFC in response to an issue alluded to by
> Dongxiao (or was it you?) about things not being fully initialized by
> the time the async code starts.  Is this a real issue, and if so,
what''s
> the correct fix?
I don''t think there is an actual current issue, just a potential one
since we are relying on data structures being zeroed rather than
properly initialised to keep the async code from running off into the
weeds, it just seemed a little fragile this way.

Originally I said:>> The crash is in one of the calls to list_move_tail and I think it is
>> because netbk->pending_inuse_head not being initialised until after
>> the 
>> threads and/or tasklets are created (I was running in threaded mode).
>> Perhaps even though we are now zeroing the netbk struct those fields
>> should still be initialised before kicking off any potentially
>> asynchronous tasks?
this specific issue was fixed by zeroing the netbk array as it is
allocated, I just thought we could make things more robust by not
triggering the async code until everything was fully setup.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2010-Jul-01 16:06 UTC

head link

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

On Thu, 2010-07-01 at 16:47 +0100, Ian Campbell wrote:> On Thu, 2010-07-01 at 16:29 +0100, Jeremy Fitzhardinge wrote:
> > On 07/01/2010 04:48 PM, Ian Campbell wrote:
> > > On Mon, 2010-06-21 at 12:14 +0100, Jeremy Fitzhardinge wrote:
> > >   
> > >> Subject: [PATCH] xen/netback: make sure all the group
structures are initialized before starting async code
> > >>
> > >> Split the netbk group array initialization into two phases:
one to do
> > >> simple "can''t fail" initialization of
lists, timers, locks, etc; and a
> > >> second pass to allocate memory and start the async code.
> > >>
> > >> This makes sure that all the basic stuff is in place by the
time the async code
> > >> is running.
> > >>     
> > > Paul noticed a crash in netback init because...
> > >
> > >   
> > >> @@ -1764,16 +1766,6 @@ static int __init netback_init(void)
> > >>  		netbk->netbk_tx_pending_timer.function > >
>>  			netbk_tx_pending_timeout;
> > >>  
> > >> -		netbk->mmap_pages > > >> -		
alloc_empty_pages_and_pagevec(MAX_PENDING_REQS);
> > >> -		if (!netbk->mmap_pages) {
> > >> -			printk(KERN_ALERT "%s: out of memory\n",
__func__);
> > >> -			del_timer(&netbk->netbk_tx_pending_timer);
> > >> -			del_timer(&netbk->net_timer);
> > >> -			rc = -ENOMEM;
> > >> -			goto failed_init;
> > >> -		}
> > >> -
> > >>  		for (i = 0; i < MAX_PENDING_REQS; i++) {
> > >>  			page = netbk->mmap_pages[i];
> > >>  			SetPageForeign(page, netif_page_release);
> > >>     
> > > ...this dereference of netbk->mmap_pages[i]...
> > >
> > >   
> > >> @@ -1786,6 +1778,26 @@ static int __init netback_init(void)
> > >>     
> > > [...]
> > >   
> > >> +		netbk->mmap_pages > > >> +		
alloc_empty_pages_and_pagevec(MAX_PENDING_REQS);
> > >>     
> > > ... happens before this initialisation of the array.
> > >   
> > 
> > Hm, I hadn''t meant to commit that properly.  I had it locally
and
> > accidentally pushed it out.
> > 
> > I only did that patch as an RFC in response to an issue alluded to by
> > Dongxiao (or was it you?) about things not being fully initialized by
> > the time the async code starts.  Is this a real issue, and if so,
what''s
> > the correct fix?
> 
> I don''t think there is an actual current issue, just a potential
one
> since we are relying on data structures being zeroed rather than
> properly initialised to keep the async code from running off into the
> weeds, it just seemed a little fragile this way.
> 
> Originally I said:
> >> The crash is in one of the calls to list_move_tail and I think it
is
> >> because netbk->pending_inuse_head not being initialised until
after
> >> the 
> >> threads and/or tasklets are created (I was running in threaded
mode).
> >> Perhaps even though we are now zeroing the netbk struct those
fields
> >> should still be initialised before kicking off any potentially
> >> asynchronous tasks?
> 
> this specific issue was fixed by zeroing the netbk array as it is
> allocated, I just thought we could make things more robust by not
> triggering the async code until everything was fully setup.
I suspect simply not letting the kthread off the leash until the end of
the loop may be sufficient. The tasklet case I think is safe until they
are explicitly tickled.

e.g. something like:

diff --git a/drivers/xen/netback/netback.c b/drivers/xen/netback/netback.c
index 29b60c8..36f3cc7 100644
--- a/drivers/xen/netback/netback.c
+++ b/drivers/xen/netback/netback.c
@@ -1794,7 +1794,6 @@ static int __init netback_init(void)
 
                        if (!IS_ERR(netbk->kthread.task)) {
                                kthread_bind(netbk->kthread.task, group);
-                               wake_up_process(netbk->kthread.task);
                        } else {
                                printk(KERN_ALERT
                                        "kthread_run() fails at
netback\n");
@@ -1820,6 +1819,9 @@ static int __init netback_init(void)
                spin_lock_init(&netbk->net_schedule_list_lock);
 
                atomic_set(&netbk->netfront_count, 0);
+
+               if (MODPARM_netback_kthread)
+                       wake_up_process(netbk->kthread.task);
        }
 
        netbk_copy_skb_mode = NETBK_DONT_COPY_SKB;





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Jul-01 16:07 UTC

head link

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

On 07/01/2010 05:47 PM, Ian Campbell wrote:>> Hm, I hadn''t meant to commit that properly.  I had it locally
and
>> accidentally pushed it out.
>>
>> I only did that patch as an RFC in response to an issue alluded to by
>> Dongxiao (or was it you?) about things not being fully initialized by
>> the time the async code starts.  Is this a real issue, and if so,
what''s
>> the correct fix?
>>     
> I don''t think there is an actual current issue, just a potential
one
> since we are relying on data structures being zeroed rather than
> properly initialised to keep the async code from running off into the
> weeds, it just seemed a little fragile this way.
>
> Originally I said:
>   
>>> The crash is in one of the calls to list_move_tail and I think it
is
>>> because netbk->pending_inuse_head not being initialised until
after
>>> the 
>>> threads and/or tasklets are created (I was running in threaded
mode).
>>> Perhaps even though we are now zeroing the netbk struct those
fields
>>> should still be initialised before kicking off any potentially
>>> asynchronous tasks?
>>>       
> this specific issue was fixed by zeroing the netbk array as it is
> allocated, I just thought we could make things more robust by not
> triggering the async code until everything was fully setup.
>   
It would only affect system startup time, not domain creation?

I was looking at it because Stefano was having fairly consistent crashes
on domain creation, and it looked like sort-of-racy symptoms.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2010-Jul-01 16:11 UTC

head link

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

On Thu, 2010-07-01 at 17:07 +0100, Jeremy Fitzhardinge
wrote:> On 07/01/2010 05:47 PM, Ian Campbell wrote:
> >> Hm, I hadn''t meant to commit that properly.  I had it
locally and
> >> accidentally pushed it out.
> >>
> >> I only did that patch as an RFC in response to an issue alluded to
by
> >> Dongxiao (or was it you?) about things not being fully initialized
by
> >> the time the async code starts.  Is this a real issue, and if so,
what''s
> >> the correct fix?
> >>     
> > I don''t think there is an actual current issue, just a
potential one
> > since we are relying on data structures being zeroed rather than
> > properly initialised to keep the async code from running off into the
> > weeds, it just seemed a little fragile this way.
> >
> > Originally I said:
> >   
> >>> The crash is in one of the calls to list_move_tail and I think
it is
> >>> because netbk->pending_inuse_head not being initialised
until after
> >>> the 
> >>> threads and/or tasklets are created (I was running in threaded
mode).
> >>> Perhaps even though we are now zeroing the netbk struct those
fields
> >>> should still be initialised before kicking off any potentially
> >>> asynchronous tasks?
> >>>       
> > this specific issue was fixed by zeroing the netbk array as it is
> > allocated, I just thought we could make things more robust by not
> > triggering the async code until everything was fully setup.
> >   
> 
> It would only affect system startup time, not domain creation?
I think so, it''s a race during netback_init.

netbk_action_thread can reference &netbk->net_schedule_list (via
tx_work_todo) before it is initialised. It is now zeroed which is
probably safe but not exactly robust.
> I was looking at it because Stefano was having fairly consistent crashes
> on domain creation, and it looked like sort-of-racy sy
I think this particular race would be long gone by the time you started
a domain. Although I suppose having the thread spinning doing
potentially arbitrary things for a small window during netback_init
could leave things in a fragile enough state that it might fall apart
when you started a domain. I''m not convinced though.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Jun 2010 - [PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

[Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

RE: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue fornetback multiple threads patchset

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue fornetback multiple threads patchset

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

RE: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

RE: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

RE: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

RE: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset

Re: [Xen-devel][PV-ops][PATCH] Netback: Fix PV network issue for netback multiple threads patchset